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" Preface 


SINCE 1939, THE DATE OF PUBLICATION OF THE FIRST EDITION, 
many important advances have been made in statistics, necessitating a 
very complete revision and the introduction of much new material for 
the present volume. The process has been time-consuming, and new 
techniques and procedures continually being evolved have made it very 
difficult to bring the entire book up to date. To make its contribution 
as valuable as possible in a world of changing statistical techniques, 
there has been as much emphasis as possible on fundamental principles 
and basic theory. This knowledge will always be required by the 
student in order to understand and appreciate new advances. 

A book on statistics must necessarily reflect the author's experience 
and needs in teaching and giving advice to research workers. For this 
reason, if every teacher were to write a book, the books would all be 
different and not one of them would be perfectly satisfactory to all the 
others. In writing a text it is necessary therefore to try to strike'an 
average with respect not only to subject matter but also to methods of 
presentation and development. The ideas of a number of teaching stat- 
isticians have been obtained, and as many of them as possible have 
been incorporated into the present book. 

The subject matter is slanted rather definitely towards the needs of 
the student who is now, or will eventually be, a research worker. This 
viewpoint was adopted for two reasons. In the first place it is not likely 
that the author would be able. to write any other kind of book, and in 
the second place it would seem to meet the needs of the majority of 
students of science at the present time. 

In the development of each procedure an attempt has been made 
to follow a uniform method. After general statements the algebraic 
development is given, and this is followed by a completely worked-out 
example. The teacher may find it expedient to have the students work 
through the example before they have fully mastered the principles 
given in the algebraic development. After some familiarity with the 
numerical procedure is achieved, it is often much easier for them to 
grasp the fundamental principles involved. 

Although the book may be used to a considerable extent as a ref- 
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erence book for methods of analysis, it is not designed entirely for this 
purpose. The emphasis is on giving detailed methods of analysis which 
bring out fundamental principles. This means too that the length of a 
given section or chapter is not necessarily related to its utility. The 
chapter on the treatment of non-orthogonal data is an example. It is 
the longest chapter in the book, but it very likely will not be used for 
reference purposes as much as many of the shorter chapters. Together 
with the chapter on the analysis of variance, it aims, however, to give 
the student that knowledge of fundamental principles necessary for a 
full understanding of the principles of experimental design. 

It is difficult to make adequate acknowledgments because of the 
number of people who have at one time or another been of assistance. 
Free use has been made of material in other textbooks, particularly in 
the excellent books by R. A. Fisher, G. W. Snedecor, W. G. Cochran 
and Gertrude M. Cox, W. A. Shewhart, J. G. Smith and A. J. Duncan, 
Paul G. Hoel, M. G. Kendall, D. J. Finney, and K. Mather. Professor 
R. A. Fisher was kind enough to examine and criticize the chapter on 
basic experimental designs, and Mr. D. J. Finney, the chapter on probit 
analysis. Mr. G. B. Oakland and Dr. J. W. Hopkins have constantly 
been consulted on various points. Others have contributed material, 
and Mrs. J. Sweetland and Miss J. Thomas have struggled successfully 
with two or three versions of the typescript. Mr. G. Ballantine has 
checked the figuring on all examples and exercises. 


Cyrit Н. GOULDEN 
Ottawa, Ontario 
June, 1952 
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Introductory Concepts 


1. Statistics and the student of science. Statistics is a subject of 
particular interest to the science student. He will find that it has practical 
applications in most of the fields in which he is likely to find his life work, 
and in any event it will help him to a better understanding and apprecia- 
tion of the phenomena of nature with which he is bound to be in close 
contact throughout his career. In the first place, statistics helps the 
student to understand the nature of variability. This is important in 
practical applications because phenomena observed under one set of 
conditions are never duplicated exactly under another set of conditions. 
It is important in research work because the investigator must take into 
account all the variable factors and attempt to design his experiments and 
interpret his results accordingly. It is important to industries because 
the factors causing variability introduce variations in the quality of their 
output. 

In the second place, statistics teaches how to derive general laws from 
a mass of individual determinations that in themselves are meaningless. 
On tracing the pattern made by a moving molecule we find that it darts 
about in varying directions at high velocities. At no particular point can 
it be said that a certain molecule is going to travel in a given direction at 
a given velocity. Its behavior as a single molecule is entirely unpredict- 
able. It is well known, however, that as the temperature rises the mole- 
cules travel at greater velocities. If the substance under consideration is 
a gas within a closed chamber, as the temperature rises the molecules 
bombard the walls of the chamber with increasing force and the pressure 
on the walls increases accordingly. Pressure determinations show that, 
within certain limits, the pressure changes are proportional to temperature. 
In other words, in spite of the vagaries of the movement of individual 
molecules, their numbers are so large that the effects they produce are 
uniform and consistent.. This simple example that every student has 
observed in the laboratory is an illustration of a statistical law. Analogous 
laws may be derived from experiments although the results may not be 
as consistent as in the above example. A comparison between two seed 
treatments is being made. Because of variations in soil and climate the 
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difference between these two treatments with respect to their effect on the 
yield of the crop may be inconsistent, but if a real difference exists and a 
sufficient number of comparisons are available this real difference will 
become evident. The result can then be expressed as a general law 
because we have confidence that, if the whole experiment is repeated a 
sufficient number of times, similar results will be obtained. 

One of the fundamental functions of statistics is that of expressing, in 
summary form, facts that are represented by a large number of determin- 
ations. Thus the average or mean is often used as the best single value 
for representing a group. The variability of a group of observations is 
expressed by the value of the standard deviation. A trend such as the 
increase in yield with increasing amounts of fertilizer is expressed first in 
the form of a graph and finally in the form of a regression coefficient. 
This emphasizes the descriptive value of statistics and its power to con- 
dense a large amount of information into a single graph or a single value. 
In dealing with biological phenomena, where variation is a rule rather 
than the exception, it is obvious that this phase of statistics has a wide 
application. 

For the student who contemplates advanced study with the object of 
doing research, the study of statistics will. prove a necessity. Some 
reasons for this will be clear from what has already been stated, but a 
more pressing reason is that statistical principles are involved in the 
efficient and economic design of experiments as well as in the interpreta- 
tion of the results. It is only within recent years that statisticians have 
been able to make clear, and to have generally accepted, the principle that 
Statistical methods are required in the design of experiments. It was a 
common occurrence for research workers to consult a statistician after 
the experiment had been conducted and the data tabulated. In many 
cases the results proved of little value, and in many others it was found 
that the experiment could have been performed in an easier way or that it 
could have provided much more information for a given amount of effort. 

Although a large part of this book is taken up with the application of 
statistics to the design and analysis of experiments, it should not be taken 
for granted that this is necessarily the most important field for the applica- 
tion of statistics. The principles of statistics and the various theorems 
with which it deals are common to all applications. Certain specialized 
procedures may be more common in one field of work than another, but 
one never knows when they will be found quite applicable in an entirely 
different field. In the following pages a chapter will be found on the 
application of the statistical principles of quality control to the manu- 
facturing industry. This is an important branch of statistics and one in 
which a number of students may find their greatest interest. 
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One phase of statistics deals with the correct methods for taking samples 
in order to obtain information on the lot or the population from which 
the sample is drawn. Such sampling problems are universal. They 
apply to business, manufacturing, social studies, economics, land evalua- 
tion, research experiments, medicine, and a host of other subjects. There 
is a similarity, for example, in methods of sampling that might be adopted, 
in one instance in order to conduct a social survey of people in various 
strata of society, and in the other instance in order to determine the 
average number of grasshopper eggs in a square foot of soil in a given field. 

The general conclusion is that statistical principles have a very wide 
application, and it is the study of these principles with which the student 
must be concerned chiefly, rather than techniques of particular value in 
restricted fields. The student of science in particular has interests in a 
wide variety of subjects and therefore has a definite need for a knowledge 
of statistics with special reference to its fundamental principles. 

2. Mathematical ideas are fundamental to statistics. There is a good 
deal of argument concerning the mathematical training a student requires 
before he undertakes a study of statistics. The statistical methods and 
discussions in this book do not require a knowledge of advanced mathe- 
matics, but they do require a solid foundation in the basic mathematics 
such as elementary algebra, geometry, and trigonometry. The algebra 
of probability is particularly important. For purposes of review, a few 
of the important mathematical ideas are discussed below. 

a. Probability. Mathematicians have had lengthy arguments about 
the definition of probability. For our purpose it is convenient to accept 
the definition given by Arley and Buch in their Introduction to the Theory 
of Probability and Statistics. This definition is in reality a concept that 
must be described rather than.stated in a few words. Out of a series of 
n events suppose that n of these are of a certain specified class. For 
example, in driving through farming country each field of grain observed 
is an event. Of these fields some are wheat and therefore belong to a 
specific class, the others being oats, barley, etc. Of n fields observed, if 
n, are wheat we say that 


na 
PA a 1 


is the relative frequency of A in the series of observations. Now, it can 
be shown by actual trial in making repeated observations that f(A) is 
more consistent as is increased. For small samples there will be wide 
variations, but for large samples the observed values of f(A) will be 
closely grouped around some central value. This tendency to grouping 
as the number of observations is increased may be referred to as the 
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random law. The observed values may be interpreted as estimates of 
some physical constant which is a fixed or “true” value. Quoting from 
Arley and Buch: “This physical constant, for which the relative frequencies 
are experimental values, is called the probability of the event A . . . and 
is denoted by P(A).” 

The above definition would not allow us to find the exact value of any 
probability, but there are circumstances wherein we can deduce probabi- 
lities by making certain postulates about the conditions. For example, 
in throwing a coin it is reasonable to assume that heads and tails are 
equally likely to occur, or more specifically there is no good reason to 
assume that one is more likely to occur than the other. In a large number 
of throws it is to be expected, therefore, that the relative frequency of 
heads will be quite close to 1/2, because approximately one half of the 
throws will give heads and one half tails. In short, the probability of 
heads can be deduced from a simple definition 


n 
т + n 


P(A) = 2 


where n, + n, is all the possible ways in which the event can occur and 
т is the number of ways in which the specific event А can occur. In this 
case л = І, ny = 1, and P(A) = 1/2. 
In statistical theory it is generally true that it is not necessary to evaluate 
a probability with exactitude. As an example of a typical statistical 
problem, consider the testing of a die to determine the extent or existence 
“of a bias. If the die is thrown 1200 times and it is a true die, the expecta- 
tion is 200 aces. Note that the exact probability of throwing an ace with 
this particular die is not required. The probability arises as a result of 
the hypothesis that has been set up that the die is true. The die is then 
thrown a large number of times, and the agreement between the theoretical 
and the actual frequency of the occurrence of aces noted. If the result 
is quite close to the expected result with a true die, the conclusion is that 
at least with respect to this particular trial there is no evidence of bias. 
On the other hand, if the results diverge quite widely from expectation, 
the conclusion is that the die is biased. 
b. Basic rules of probability. The first of these may be stated very 
simply as follows. Let P(A) be the probability of an event 4 and P(B) 
the probability of an event B. Then, if the occurrence of one event 


excludes the occurrence of the other, the probability that either 4 or B 
will occur is 
P(A, B) = P(A) + P(B) 3 


Referring to die problems, A may be the occurrence of an ace, and B, a 5. 
The separate probabilities are P(A) = 1/6 and P(B) = 1/6. Ina single 
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throw if we get ап асе {һе 5 cannot occur, and if we get a 5 the ace cannot 
occur. Then the probability of turning up an ace or a Sis 1/6 + 1/6 = 1/3. 
This rule can be extended to cover any number of mutually exclusive 
events. Therefore for 3 events 


P(A, B, C) = P(A) + P(B) + P(C) 4 


Applying this rule to the probability of turning up an ace, 2, 3, 4, 5, or 6, 
we get 


P=4+t+h+t+t+t=1 
which is the obvious result because one or other of the 6 possible events 
must happen. 

The second general proposition covers events that are not mutually 
exclusive, as throwing 2 dice and determining the probability of throwing 
an ace and a 5. Here the 2 events are independent. If the events are 
A and B, the probability of their occurring together is 


P(A + B) = P(A) x P(B) 5 


For the dice problem 
P(1 + 5) = x ۾‎ = зв 

A variation of this problem is encountered when the second event is 
conditional on the occurrence of the first event. Thus in drawing 2 cards 
from a pack we might consider the probability of drawing 2 aces. Note 
that the probability of drawing the first ace is 4/52, and having drawn the 
ace the probability of drawing a second one is 3/51; therefore the prob- 
ability of drawing 2 aces is 4/52 x 3/51 = 12/2652 = 1/221. The 
general statement can be made for problems of this type that 


Р = P(first event happens) х P(second event happens, given the first) 


or 
P = P(A) x P 4(B) 6 


where Р (В) represents the probability of B conditional on A. 

с. Permutations and combinations. Taking the 4 letters a, Б, с, d, the 
permutations of these letters 2 at a time are ab, ba, ac, ca, ad, da, bc, cb, 
bd, db, cd, de. Note that permutations are different if they contain the 
same letters but in different order. For л things taken r at a time, the 
total number of permutations is given by 


P, = n(n—1(n—2): * "1—74 1) 7 


nfr 


If all the л things are to be taken, we have 


Ры-атп-1У)п-2):::3:2-1 8 
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The continued product n(n — 1)(n— 2): ° + 3-2-1 is known as factorial 
n and is indicated by the symbol л! ns 

When the things to be arranged are not all different, as in determining 
the number of permutations of the 7 letters aabbbcc, we note that there 
аге 2 a's, 3 Б, and 2 сѕ. Then the total number of permutations is 


7! 
2001 
If п represents the number of letters and p,q,r,- : the number of each 
„Кіпа, the general formula for the total number of permutations taken all 


together is 
n! 


P= pn 2 


Combinations refer to the groups of things that can be arranged with 
reference to kind but-not to order. Thus all the combinations of the 
4 letters a, b, c, d are ab, ac, ad, be, bd, cd. For n things taken r at a 


г Ше n\. 
time, the number of combinations represented by () is 


(") n(n— 1)(n— 2): ++ (n—r +1) n! 10 


n r! rYn— r)! 


The total number of combinations of n different things taken 0, 1, 2, * * * 
- or n at a time is 2". 

Permutations and combinations are important in working out problems 
in probability and therefore occur frequently in procedures of mathe- 
matical statistics. A simple problem involving permutations is to assume 
12 chips in a bowl with the chips numbered from 1 to 12. If 2 chips аге 
drawn at random, what is the probability that the sum of the numbers 
on the chips will be 5? Two chips can be drawn from 12 in 12(12 — 2 + 1) 
— 12 x 11 ways, using equation (7) above. The sum of 5 can be obtained 
on 2 chips when the numbers are 4 and 1, 3 and 2, 2 and 3, or 1 and 4, 
making a total of 4 ways. The required probability is then 4/132 = 1/33. 

An important proposition involving combinations comes from con- 
sidering such problems as the probability of the occurrence of different 
numbers of heads in tossing n coins. Suppose that 4 coins are thrown. 
The different numbers of heads and tails possible are obviously 4 heads, 
0 tails; 3 heads, І tail; 2 heads, 2 tails; 1 head, 3 tails; and 0 heads, 
4 tails. To determine the probability of each result it is convenient to 
consider the 4 coins arranged in a row occupying the positions numbered 
1, 2, 3, 4, as shown below. Beginning with zero heads the total number 
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of arrangements is (б) — |. One head can occur in any one of the 4 


positions or in = 4 ways. Similarly, 2 heads can occur in (3) =6 


4 
1 
ways, and so forth for 3 and 4 heads. 


Position (9 

172772214 р А 

0120250560 1 1/16 

One of the possible arrangements HOTO 4 1/4 
of 0, 1, 2, 3, and 4 heads Н ie жо 6 3/8 
„12/3; E ү 2 
HHHH 1 1/16 

Total 16 1 


| The probabilities as given in the last column are then easily obtained. 

For any similar problem we can write down the probabilities directly. 
The denominator is 2", where n represents the number of coins thrown, 
and the numerators are % 


ЗГЕ” 


ог 
n! n! 


ШАН Ты. ББ Т 


2!(n— 2)" * rn — r)! 


4. Binomial theorem. This theorem is stated usually in one of the 
following forms. 


CEU n n- n та 
(qt p = 9 + (4 y (,"))« p^ + a 


( n jet oum 12 


ж. 
EXT 
4" } npt = ) pap + eB 
n 
атр" «9718 n 13 
"аста! л 08 


rem in probability problems. As a 


This is an extremely useful theo ! 
simple example, suppose that p is the probability of the occurrence of an 
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° event, and q = 1 — pis the probability of its failure. Then the probability 
in n trials that the event will occur exactly r times is given by 


n! —T pT 
P(r) = даш aa 14 
which is seen to be the general term of the binomial expansion. In other 
words the successive terms of the expanded binomial give the probability 
in n trials of 0, 1, 2, 3,* °°, r,* * > n occurrences of the event. 
Referring to die-throwing problems again, the probability of throwing 
an ace is 1/6 and of failing to throw an ace is 5/6. Therefore p = 1/6 
and q = 5/6. In throwing 6 dice, the probability of throwing exactly 


2 aces is 0 (y= () (i) = 0.2009 


and the probability of throwing either 0, 1, or 2 aces is 


(vs (@) (e) + s (gl). 
6) * 11st \6) \6) ^ 2! 1646 
е 

е. Algebraic functions. Every student f statistics should have an 
understanding of the meaning of functions suchas у = 2x? or y = e. 
It is understood here that y takes specific values dependent on the value 
of x, in which case we may state the proposition that y is a function of x, 
y = f (x) being the shorthand method of mathematics for condensing the 
statement. In statistics we deal largely with frequency functions. For 
example, we may have 


у SCE 15 


where у is a frequency and is a function of и, and е is the base of the 
Naperian system of logarithms. The quantity и may itself be a function 
of another quantity x so that equation (15) written out in full might be 


= сатен 16 


It is expected for a particular problem that с will be constant, and within 
those limits y will be strictly a function of x. Іп a frequency function, 
x will represent the variate, such as yield of grain or weight of an animal, 
and so forth. For a given value of x, the frequency function gives the 
theoretical frequency y. One of our problems in statistics may be to 
choose a frequency function such that it will fit the experimental data. 
More often the frequency function serves as a mathematical model on the 
basis of which tests may be applied to the data of the experiment. 
Functions can of course be represented graphically, and Figure 1-1 
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shows the type of graph that would be obtained for equation (15). There 
are two important facts represented by such a graph. Іп the first place, 
it shows the values of y for given values of u. For example, at u — — 1, 
the value of y is 2.4. In the second place, it shows the relation between 
areas of different parts lying between the curve and the base line. The 
shaded section, for example, shows the area lying between — 1 and — oo. 
We say — co in this case because for this particular equation the curve 


Ы | T] 


-ui 


Figure 1-1. Graph of the function y = се 


never actually touches the base line, but this should not cause any con- 

fusion. For all practical purposes the area beyond u = — 4 сап be 

considered as zero. Representing the total area of the curve by unity, 

the area from — co to — 1 is a proportion of the whole. In mathematical 

terms it is referred to as the integral of equation (15) from — œ to — 1. 
f. Simultaneous equations. ТЕ we have two expressions such as 


3x + Sy = 22 
Ix — 4y = 20 


they are referred to as a pair of simultaneous equations with two un- 
knowns, x and y. They can be solved by multiplying or dividing the 
equations by numbers that will make the coefficient of one of the un- 
knowns the same in both equations. For example, 


1(3х + 5y = 22) = 21x + 35y = 154 
3(7х — 4y = 20) = 21x — 12y — 60 
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Then, subtracting the second from the first, we have 


47у = 94 
and 
у=? 


Substituting 2 for y in one of the equations will then give a new equation 
which can be solved for x. 

It is important to remember that a set of simultaneous equations 
cannot be solved completely unless there are as many different equations 
as there are unknowns. We can, however, obtain the relative values of 
the unknowns if we are lacking only one equation and the terms on the 
right are zero, For example, we have 


8x + Sy + 22 = 0 
11x + 2y + 62 —0 
Dividing through by x gives 


у 2 
MES i 
51227562810 

х x 


and, putting y, and z, for y/x and z/x, we have 


5у + 2z, = — 
2y, + ба —— И 
Solving, 
15y, + 6z, = — 24 
2y, + ба = = И 
13у, =—13 
A =-1 
Substituting — 1 for y, in 5y, + 2z, = — 8, we get 
—5+2z,=—8 
225, ——3 


2=— $ 
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The solution of these equations establishes that у/х=—1 and 


z/x = — 3/2, and therefore that y/z = y/x x x/z = (— 1) x ( 2/3) = 2/3. 
In some applications this type of solution may be all that is required. 
g. Determinants and matrices. Considering the simultaneous equations, 


ax + һу = с, 
ах + bay = с; 
by solving we get 
A bacı — bye, 
` 4,5 а 
_ 4106 — 4961 
A aba — agb, 


Notice that the denominator is the same for both fractions. 


represented conveniently by 


а by 


It can be 


and in this form is known as a determinant of the second order, It can 
always be set up in the more direct form a,b, — ab, by adding the product 
of the first diagonal and subtracting the product of the second, where the 


diagonals are as indicated below. 


с b, а сі 

Ca ba az ey 
x= уе 

а bi a bi 


a, by a b 


17 
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The positional relations of the symbols in the determinants with the 
corresponding symbols in the original equations are easily noted, from 
which the determinants giving the required unknown can be written down 
directly. 

Third-order determinants arise from equations such as 


ах + by + ez = d 


AX + boy + caz = da 


азх + Бау + caz = dz 


from which it can be shown that 


3 келе ie, 
а, % С 
а, bs Сз 

х= 18 


These can be written down by rule, for it is obvious that іп the numerator 
d replaces the coefficient of the unknown and the denominators are all 
the same. Thus 


ard te 
аҙ а, Ca 


Gy dys о 


A third-order determinant can also be solved by а simple rule. It can 
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be written as in the diagrams below with two columns repeated. The / 
diagonals are then added and the J diagonals subtracted. 


20 


Determinants of any order exist, but those of the fourth and higher 
orders cannot be solved by the simple rules that serve for the second and 
third order (see appendix). 

A set of symbols arranged in the form 


a by 51 а, 


ооу j 
1 


аз bs Cs dy 


де 0 7180 Eh 


is known as a matrix. The term refers merely to a set of symbols arranged 
in rows and columns and is denoted as a matrix by the large parentheses 
or by double lines. Also, the number of rows is not necessarily equal to 
the number of columns. It is very convenient in mathematical writing to 
be able to refer to a set of symbols as a matrix and for certain systematic 
arrangements to be able then to represent the whole matrix by one 
symbol. It must be noted that even when a matrix is square it is not a 
determinant. However, a square matrix may determine a determinant. 


Thus for the matrix 
a, by 


| а 8 
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3. Exercises. 


1, Give three examples wherein the making of a single measurement supplies very 
little useful information. 

2. Calculate the probabilities: 

(a) That in a single throw a coin will turn up “heads.” 

(b) That in throwing 2 coins they will both turn up “heads.” 

3. Show how you would conduct an experiment to get some idea of the "'trueness" 
of a coin with respect to turning up “heads” and “tails.” 

4. Calculate the probability of obtaining the following cards in drawing 3 cards 
from a complete pack of 52. 

(a) An ace, a jack, and a deuce. 

(b) Ace, jack, and 2 of spades. 

5. In drawing 5 cards from one pack what is the probability of drawing the ace, 
king, queen, jack, and 10 of one suit. 

6. Find the following permutations and combinations. 

(a) Permutations of 15 things, 3 at a time. 

(6) Permutations of 6 things, 6 at a time. 

(c) Combinations of 15 things, 3 at a time. 

(d) Combinations of 5 things, 5 at a time. 

7. Іп how many different ways can a committee of 5 men be made up from a group 
of 16? 

8. In throwing 8 coins at once, the number turning up heads may be 0, 1, 2, . . ., 8. 
Determine the probability of throwing 0, 1, 2, ог 3 heads. 


CHAP EB 


Variation, Statistics, and 
the Frequency Table 


1. Variation. The need for the statistical method arises from the 
variability of the material with which we have to deal. Illustrations of 
variation are abundant in the biological field. Everyone is familiar with 
the fact that not all the apples on a tree are the same size, that not every 
acre of grain produces the same number of bushels, that not all men are 
the same height, that not all animals of the same group gain equally on 
equal amounts of food, afd so forth. It is possible to cite a great many 
other examples of a similar nature. 

Since variability exists, it is obvious that a problem arises in attempting 
to define the characteristics of groups. It could be said that, since the 
characteristics to be measured are subject to variability, it is not possible 
to establish figures that are of any value in describing the group; but 
that is clearly absurd, as such measures are continually being employed 
with considerable confidence. It is the purpose of statistics to define 
the best methods of describing the characteristics of groups of individuals 
and to demonstrate how such descriptions can be subjected to objec- 
tive tests of their reliability. Generally speaking, the method is to 
prepare tables and graphs, to calculate statistics, and to apply tests of 
significance. 

2. Populations. It is difficult to define exactly what is meant by the 
term population in the statistical sense, but it is somewhat easier to give 
examples. If we were trying to determine the average income of the 
male citizens of Canada, the population we have in mind is strictly 
speaking a population of incomes and not a population of male citizens. 
Such populations are referred to as finite populations as distinguished 
from infinite populations which form the background of our thinking in 
the development of mathematical statistics. To quote R. A. Fisher [2]: 
“The idea of infinite populations distributed in a frequency distribution 
in respect of one or more characters is fundamental to all statistical 
work." 

One of the principal objectives of statistics is to draw inferences with 

15 
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respect to populations by the study of groups of individuals drawn at 
random that are referred to as samples. 

3. Samples. As pointed out above, the purpose of drawing samples is 
to obtain information about the populations from which they are drawn. 
This information will be of real value only if the sample is drawn in such 
a way that the results obtained are unbiased. In the usual course of 
events. this is ensured by drawing the sample at random; i.e., in making 
up the sample each individual in the population has an equal chance of 
being included. The size of the sample varies, depending on expense, 
convenience, accuracy required, and many other factors. In fact, 
sampling is the subject of much discussion in statistical literature, and an 
excellent book on the whole subject has been written by W. Edwards 
Deming [1] to which the student should refer for a detailed discussion of 
many sampling problems. t 

4. Arithmetic mean. The arithmetic mean is our first sample of a 
statistic. Yt is called a statistic because it is calculated from a sample 
and for our purpose is used to estimate the mean of the population from 
which the sample is drawn. It is necessary in studying statistics to keep 
in mind this important proposition with respect to all statistics calculated 
from samples. Thus, the yield of a variety of wheat as taken from a 
number of plots in an experimental field is not of particular interest in 
itself as it is merely an estimate of what the variety would yield if grown 
over a large area under similar conditions. 

For a sample of N variates where X; represents any one variate, the 
mean X is given by 


X ^ Ore Ce CTS mE +++ AN 
N 


which for the sake of abbreviation is written 


N 
2 (X) 
y= i=1 


N 


Applying the short formula means simply that the summation of the 
quantities is understood and that we do not need to write them all out 


N 
and connect them with plus signs. The expression > (X;) shows that 
i=1 
N values are summated; but the simpler form У(Х) may be used when 
the number of values summed is obvious. 
One of the most interesting properties of the mean is that the sum of 
the deviations of all the individual variates from the mean is zero. This 
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may be shown as follows, with X; representing an individual variate, and 
x, = (X;— X) an individual deviation. 


Ba) O H+ (%— 2) O 3) + Oy D 
any eee ee EX Е 0) АЎ 
and since 
ут мизан edes X кй 


it is clear that Ж(х) = 0. Using the summation sign in order to shorten 
the algebra, 
X(x)-2X(X,— X) = ХХ)- NX 
and since 
NX(X) 
N 


NX= 


it is again clear that X(x,) = 0. 

Another interesting property of the mean is that the sum of the squares 
of the deviations from the mean, &(X;— Xy = X(x,)?, is less than the 
sum of the squares of the deviation from any other value. This can be 
stated formally as follows. 


Е IK My N(X— My 3 


Thus, if M is substituted for X, the sum of the squares of the deviations 
is increased by N(X — M)?.* 

5. Standard deviation. In taking the mean of a sample to represent the 
sample as a whole, it should be clear that the reliability of this mean will 
depend on the degree of variation among the individual variates that make 


* Let M= X4 А. 
Then 


у(х, M} = XQ, X + AY = ХХ, X): AP 
= уху XP + 2AX(G, — X) + МАЗ 


From (2) above it is known that X(X,— X) = 0, and in the expression 2AX(X; — Х) 
each deviation is multiplied by 2А, which is constant, and consequently the entire 


expression equals zero. Finally 2 
S(X;— M} = D(X; Х) + МАЗ 
= у(х; X} + N(X— MY 


which is equivalent to (3) in the text. 
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up the sample. If there is no variation, the mean represents the whole 
sample perfectly; but, as the variation becomes greater, the single value 
of the mean is less and less descriptive of the entire group, and it becomes 
necessary, in order to describe the sample more completely, to have some 
measure of variability. The average deviation from the mean might 
suggest itself, but the sum of the deviations from the mean is zero 
(Section 4), and from this it follows that the average of the deviations is 
also zero. We cannot deal with a statistic that is always zero, and the 
only alternative is to disregard signs and take the mean of the absolute 
deviations, but a statistic of this type does not lend itself readily to 
algebraic manipulation. 

Other measures of variability that might be selected could be based on 
the values of (X, — Xy, У(Х, — Xy, У(Х, — X)4, and so forth. Тһе 
even powers would always be positive, and the odd powers could be 
positive, zero, or negative. In making the selection the first point to 
note is that, the higher the power of the deviations, the greater the effect 
of individual outlying deviations. . When this is taken into consideration 
together with the fact that it is not desirable to complicate the computa- 
tions any more than is necessary, the value of X(X,— Ж)? has been 
universally selected as the basis of a standard measure of variability. 
This is the standard deviation represented by the symbol s', given by 


} E NEC ^ 
2 N N 


The direct method of calculating the standard deviation is to determine 
all the deviations from the mean, square them, summate, divide by N, 
and then extract the square root. When there are several variates in the 
sample and especially when the deviations contain decimal figures, à 
much shorter method is available. The main part of the work is to find 
the sum of squares of the deviations, and it can be shown very easily that 


(xy 


ix? = ХХР 5 


This formula is particularly applicable in machine calculation and is now 
employed almost exclusively in statistical laboratories. 

It is necessary here to consider a point that is very important in the 
practical application of statistical methods. The mean of a sample may 
be taken as the best possible estimate of the mean of the population from 
which the sample is drawn. The reason for this is that the mean is not 
a biased statistic. It is the best estimate of the population mean in the 
sense that, if we have a large group of means from samples of equal size 
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and determine their average, this value will be the same as if we deter- 
mined the mean of the whole collection of individuals without regard to 
their arrangement into groups. Now, for the population being sampled · 
it follows that the mean and the standard deviation for this population 
are fixed values and hence may be called parameters. If the mean of the 
parent population is denoted by ji, then Х, the mean of the sample, is 
an unbiased estimate of the parameter ji. Similarly, denoting the 
standard deviation of the population by о, what is calculated from the 
sample should be an unbiased estimate of this parameter. Actually this 
estimate is not the root-mean-square deviation defined above. This 
arises in part from the fact that, if is the mean of the population, the 
unbiased estimate of c? is given by 


ХХ;- Du 
N 


6 


but и is not known, and we must substitute X, the mean of the sample. 

From (3) above it is obvious that 

x(x—X» (Хн)? 
N N 


In other words s’? will be smaller than Х(Х- ШҰМ by a quantity equal 
to (X— u}. Further algebraic development of this idea leads to the 
proof that an unbiased estimate of o? is given by 

У(Х, Xy 


x SOREL! 8 
5 N—1 


sa 


(X— и)? 7 


A simple proof of this point based on mathematical expectation is given 


by Hoel [3]. 
It might be assumed that an unbiased estimate would then be given by 


У(Х, Xy 

M rcr. a UE 9 
N—1 

but in actual fact this has a slight bias, and this bias can be removed 

completely only if we know that the population is normally distributed 


as defined in Chapter 3. Тһе formulas for giving unbiased estimates of 


с are 
ІІ: Sta 
CETT 2 AFIR 
S or (үкси r a D 10 
| N E MESS 
2 


ES 


20 VARIATION, STATISTICS, AND THE FREQUENCY TABLE [2] 


where б means the estimated value of с. Actually we frequently employ 
s as an unbiased estimate of o because the bias is generally not large 
enough to be important. . 

It should be noted that the symbolism adopted here restricts the Greek 
letters to population parameters and the Latin letters to statistics that 
are estimates of the parameters. This is followed throughout where 
possible. 

The divisor (N — 1) in (8) above is known as the number of degrees 
of freedom (DF) in the sample that are available for estimating the 
standard deviation. The N degrees of freedom originally available in 
the sample is reduced by 1 because 1 statistic, Y, was itself calculated 
from the sample and applied in determining the estimate of the standard 
deviation. 

6. Standard error of a mean. It is obvious that sample means will vary 
from sample to sample, and the degree of variation will be related to the 
degree of variation among the individual variates. The general relation 
between the variation of individual variates and that of means of samples 
is given by 

oe NS 
ШЕГІ 


where s is the standard error of the mean of a sample, s is the standard 
deviation for the sample as a whole, and N is the number in the sample. 
In other words, if s is taken as the estimate of the standard deviation of 
the parent population, s| VN is the estimate of the standard deviation of 


11 


a population of means of samples of size N drawn from the same popula- 
tion. The standard error of a mean is therefore inversely proportional 
to the square root of the number in the sample. 

7. Variance. The term variance is now used very extensively in con- 
nection with the statistical analysis of the results from experiments. The 
student should therefore be familiar with it from the first. The variance 
of a population is merely the square of the standard deviation and there- 
fore would ordinarily be represented by the symbol о. Similarly, the 
unbiased estimate of o?, commonly referred to as the mean square 
deviation, would be represented by Сый 

Like the standard deviation the variance is a measure of variability 
but its particular value in statistical analysis is discussed in detail in 
Chapter 5, The Analysis of Variance. 

8. Frequency table. The frequency table shows, for the sample studied, 
the frequencies with which the variates fall into certain clearly defined 
classes. If the sample is small the frequency table will ordinarily not be 
necessary, but for large or moderately large samples it is usually desirable 
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to begin the reduction of the data with a table of this kind. The frequency 
table provides the values for easy graphical representation, and from it 
such statistics as the mean and standard deviation may be calculated with 
much greater ease than from the original set of individual values. 

9. Setting up a frequency table. The first task is to decide on the class 
values. This will depend on the accuracy required in the computation 
of statistics from the table, the range of variation (which is, of course, the 
difference between the lowest and highest value of the sample), the number 
in the sample or total frequency, and the facility with which these classes 
can be handled in computation. In the first place, the greater the number 
of classes, the greater the accuracy of the calculations made from the 
table, But there is a limit to the number of classes we can handle con- 
veniently, and these two opposing factors must be balanced. A good 
general rule is to make the class interval not more than one-quarter of the 
standard deviation. Of course we do not as a rule know the standard 
deviation before the table is drawn up, but it is possible to make a rough 
estimate of its value from the range of variation. Tippett [5] has pub- 
lished detailed tables on the relation between the range of variation and 
the standard deviation, and these have been summarized in a short table 
prepared by Snedecor [4]. The values in Table 2-1 were taken from 


TABLE 2-1 


VALUES OF RANGE DIVIDED BY THE STANDARD DEVIATION FOR 
SAMPLE SIZES FROM 20 то 1000 


Number in Sample Range/o Number in Sample Range/o 
20 2-7 200 5.5 
30 4.1 300 5.8 
50 4.5 400 5.9 
75 4.8 500 6.1 
100 5.0 700 6.3 
150 5.3 1000 6.5 


Snedecor’s table after rounding off the figures to two significant digits. 
Suppose that we have a sample of 500 and the range of variation is 0.25 
to 2.63 — 2.38. The expected standard deviation will then be 2.38/6.1, 
and a suitable interval will be 4(2.38/6.1) = 0.098. It is more con- 
venient to have an odd number for a class interval than an even one, 
since then the mid-point of the interval does not require one more decimal 


place than we have in the values that define the class range. In the 


example above we should probably decide on an interval of 0.11. 
In making up the classes it is usual to begin with the lower boundary 
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of the first class slightly below the lowest value, so that the classes and 
mid-points would finally be set up somewhat as follows. 


Class Value, or Mid-Point 


Class Range of Class Range 
0.19-0.29 0.24 
0.30-0.40 0.35 
0.41-0.51 0.46 
0.52-0.62 0.57 

etc. etc. 


These class ranges assume that values such as 0.296 having an additional 
decimal point will not occur. If 0.296 should occur, it would be put in 
the class 0.30 to 0.40. 

By following the above rules we ensure a sufficient degree of accuracy 
in any statistics that are calculated from the frequency table; but, if the 
table is required mainly for the preparation of a graph as described below, 
this method may give classes that are too small, in that some of the classes 
may have very low frequencies or none at all. It is desirable then to 
make the class interval from one-half to one-third of the standard 
deviation.* 

Sorting is greatly facilitated by writing the value of each variate on 
cards of a convenient size for handling. If the class ranges are written 
out on cards and arranged in order on a table, the sorting can then be 
done rapidly, and on completing the distribution it is easy to run through 
the piles and obtain a check on the work. It is important to have perfect 
accuracy at this point, as a misplaced card may give a great deal of trouble 
at a later stage. The frequency table is made up finally by entering the 
frequencies opposite the corresponding class values. 

Table 2-2 is a sample of a frequency table. It represents data on the 
areas of 500 bull sperms. The areas are given in arbitrary units. The 


* In statistical literature one may encounter references to Sheppard's corrections for 
grouping. These are designed to remove bias from certain statistics that are calculated 
from grouped data instead of from the individual values. Thus in calculating 
X(X— X)'/N it has been shown that the bias is positive and equal approximately to 
1/12 of the square of the class interval. In the tests for abnormality described in 
Chapter 3 and in certain other specific calculations it is preferable to make the adjust- 
ments, but in general practice they are ignored and in many tests of significance it is 
best to omit them altogether. 

It is important to note that Sheppard's corrections are for removing a definite bias 
and that they do not make up for general inaccuracies introduced by having groups 
that are too large. Also they do not remove the bias completely but tend to rectify 
the results. 
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standard deviation for this sample is 2.58; therefore in order to have a 
high degree of accuracy in the calculations the class interval should have 
been about 0.6. However, this additional accuracy was not required, 
and the class interval selected was 1.0. Note the agreement between the 


TABLE 2-2 


FREQUENCY TABLE FOR AREAS IN ARBITRARY UNITS OF 
500 BuLL SPERMS 


Class Range Frequency Class Range Frequency 
0.6-1.5 5 8.6-9.5 69 
1.6-2.5 6 9.6-10.5 54 
2.6-3.5 10 10.6-11.5 35 
3.6-4.5 25 11.6-12.5 22 
4.6-5.5 44 12.6-13.5 12 
5.6-6.5 61 13.6-14.5 4 
6.6-7.5 68 14.6-15.5 1 
7.6-8.5 83 15.6-16.5 1 


range and the standard deviation as given in Table 2-1. Неге the range 
is approximately 0.6 to 16.5 = 15.9. From Table 2-1 the standard 
deviation should be approximately 15.9/6.1 = 2.61. 

10. Graphical representation of a frequency table. Frequency tables 
are usually represented by graphs of two types. The more common one 


Frequency 


1234567 8 9 101112 14 15 16 


Mid-points of classes 
Ficure 2-1. Histogram for areas of 500 bull sperms. 
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is the histogram. It is a diagrammatic presentation of a frequency table | 
in which the class values are represented on the horizontal axis and the 
frequencies by vertical columns erected in their appropriate positions on 
the horizontal axis. As an example the histogram for the data of Table 
2-2 is shown in Figure 2-1. 

The other type of graph is referred to as a frequency polygon. A 
straight line is erected for each frequency at the mid-point of the corres- 
ponding class value, and the ends of these connected in series by straight 
lines. It does not give so accurate a picture as the histogram but may be 
better than the histogram for a comparison of two distributions on the 
same graph. 

11. Calculation of the mean and standard deviation from a frequency 
table. The first step is to add two more columns to the frequency table, 
as indicated in the small example given below. The mean is then given by 


ә Xf. 
х= 20%) 12 
N 
and the standard deviation by 
‚ [2X EXPN Л 
N—1 
Frequency Multiplied 
E Frequency Multiplied by Square of Class 
Class Value Frequency by Class Value Value 
Xi fi 7%; XP 
1 2 2 2 
2 4 8 16 
3 7 21 63 
4 6 24 96 
5 1 5 25 
Total 20=N 60 = Х/Х) 202 = (AX; 


It will be remembered that the numerator of the standard deviation 
contains X(x?), and it has been obtained here from the identity given 
in (5). 

Very frequently the class values are numbers containing 2 to 4 digits, 
in which case a great deal of labor can be saved by replacing them for 
calculation by the series of natural numbers 1, 2, 3, 4, • • •, etc. The 
calculations then give a standard deviation and a mean that may be | 
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represented by X^ and s”, respectively. These сап be converted into the 
true values by means of the following identities. 


X —(X"— )с+ X, 14 
coc 15 
where c is the true class interval and X; is the first true class value. 

When the number of classes is fairly large, the work can be further 
reduced by working with an arbitrary origin that is close to the mean. 
The table can be set up as follows, where actual numerical classes are 
selected for illustration. 


X; uj fi fiti би 


18: -2 
21 =1 
24 0 
27 1 
30 2 


N Xfu Хјуи? 


Here X, the arbitrary origin = 24. Note that 
XK = chy Же, 


Then 
X — сӣ + Xo 16 


where äs Efou|N. The standard deviation is given by 


А / (u, — 0) Т, 
N 


N—1 


where I 
зди ap = Хуа - Уш). 


12. Coefficient of variability. The term coefficient of variability is 
applied to the standard deviation when it is expressed in percentage of 
the mean of the sample. It is a statistic of limited application, owing to 
the difficulty of determining its reliability by statistical methods. The 


formula is 


12-2 100 
С (Coefficient of variability) = s 54 18 
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13. Exercises. 


1. Substitute the natural numbers 1, 2, 3, - · -, 16 for the class values of Table 2-2 
and calculate the mean and standard deviation. Convert calculated to actual values, 


by means of formulas 14 and 15. 
X” = 7852. Х = 7.902. s"= 2.578. з = 23578 


2. Table 2-3 gives the yields in grams of 400-square-yard plots of barley. Set up 
a frequency table and histogram for these yields. Make the class interval 11 and the 
first class 14 to 24. 

Calculate the mean and standard deviation from your frequency table. 

X" = 13.06. Х = 151.69. s" = 2.862. s = 31.49. 


3. Prove the identity: 


(=x)? 
2) — TRL 
=) = У(Х?) 


ТАВГЕ 2-3 
YIELDS IN GRAMS OF 400-SQUARE-YARD PLOTS OF BARLEY 


185 162 136 157 141 130 129 176 171 190 157 147 176 126 175 134 169 189 180 128 
169 205 129 117 144 125 165 170 153 186 164 123 165 203 156 182 164 176 176 150 
216 154 184 203 166 155 215 190 164 204 194 148 162 146 174 185 171 181 158 147 
165 157 180 165 127 186 133 170 134 177 109 169 128 152 165 139 146 144 178 188 
133 128 161 160 167 156 125 162 128 103 116 87 123 143 130 119 141 174 157 168 
195 180 158 139 139 168 145 166 118 171 143 132 126 171 176 115 165 147 186 157 
187 174 172 191 155 169 139 144 130 146 159 164 160 122 175 156 119 135 116 134 
157 182 209 136 153 160 142 179 125 149 171 186 196 175 189 214 169 166 164 195 
189 108 118 149 178 171 151 192 127 148 158 174 191 134 188 248 164 206 185 192 
147 178 189 141 173 187 167 128 139 152 167 131 203 231 214 177 161 194 141 161 
124 130 112 122 192 155 196 179 166 156 131 179 201 122 207 189 164 ІЗІ 211 172 
170 140 156 199 181 181 150 184 154 200 187 169 155 107 143 145 190 176 162 123 
189 194 146 22 160 107 70 84 112 162 124 156 138 101 138 141 143 135 163 183 
99 118 150 151 83 136 171 191 155 164 98 136 115 168 130 111 136 129 122 120 
179 172 192 171 151 142 193 174 146 180 140 137 138 194 109 120 124 126 126 147 
115 148 195 154 149 139 163 118 126 127 139 174 167 175 179 172 174 167 142 169 
122 163 144 147 123 160 137 161 122 101 158 103 119 164 112 57 94 106 132 122 
164 142 155 147 115 143 68 184 183 167 160 138 191 133 160 156 122 111 153 148 
103 131 180 142 191 175 146 181 111 110 154 176 168 175 175 146 148 167 106 123 
121 154 148 91 93 74 113 79 131 119 96 80 97 98 106 107 69 86 94 129 
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CHAPTER 3 
"^ 


Theoretical Frequency Distributions 


1. Characteristics of one actual distribution of a biological variate. 
In Chapter 2 was shown the histogram obtained when the results were 
graphed for the frequency table for 500 bull sperms. Now, if 10,000 
sperms had been measured instead of 500, the class interval could have 
been made much smaller, and, keeping the base line the same length, the 
ends of the columns would present the appearance almost of a smooth 
curve. Carrying this reasoning to its logical conclusion, we decide that 
the areas of bull sperms in an infinitely large population could be repre- 
sented by such a curve. This is actually the picture we have in mind in 
applying statistical methods to any population of a continuous variate. 
Discontinuous variables can be represented in an infinite population by 
a histogram only.* 

Having noted the manner in which at least one biological variable is 
distributed, it is of interest to inquire as to the theoretical conditions from 
which such a distribution would arise. This leads to a consideration of 
the mode of action of the causes of variation. The developing sperm is 
clearly influenced by a number of factors that may affect its ultimate size. 
As it is developing from the meiotic tissue of the glands there are innumer- 
able minor factors that can influence size, some acting to make the sperm 
larger and others to make it smaller. As a basis for arriving at a clue to 
the manner in which these factors act, it is useful to assume that they all 
act with equal intensity, and then to come to some conclusion concerning 


* A continuous variate x is one that theoretically can take any value whatever ina 
given range a < x < b, which may be finite or infinite. The yield of a plot of wheat, 
for example, could be 160.1 grams, and, if we measured it very accurately it might be 
160.0912 grams. In theory any number of decimal places can be taken, and conse- 
quently there is no logical reason why the yield of the plot cannot take any value within 
the range from zero to some positive value. In other words, the yield is a variate 
that can vary continuously. 

An example of a discontinuous variate would be the number of black balls in the 
selection of 10 balls from an urn, or the number of heads with brown chaff in 100 heads 
selected at random from a plot of wheat. Such variates can obviously take only 
certain fixed values and are correctly referred to as discontinuous. 
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the theoretical distribution that will be generated. The procedure is one 
of mathematical derivation from the assumption that there are a large 
number of factors acting with equal intensity. It leads to the distribution 
that is commonly referred to as the normal distribution. The approach 
here is purely mathematical, and the biological student will probably be 
more interested in setting up and determining the type of distribution 
generated by actual trial. 

2. Experimental derivation of a normal distribution. On examining a 
set of random numbers such as those given by Tippett [7] and Fisher 


Frequency 
E 
ө 585 


15 18 21 24 27 30 33 36 39 42 45 48 51 53 57 60 63 66 69 72 
Class values 


FIGURE 3-1. Histogram and fitted normal curve for the totals 
of 1000 groups of random digits. 


and Yates [2] we note that, if taken in groups of 10, the totals for the 
groups will vary from 0 to 90. А zero total can be obtained, however, 
only when we select 10 zeros, and a total of 90 only when we select 10 
nines. Since the probability of selecting a zero in taking 1 digit at random 
from the table is 0.1, the probability of selecting 10 zeros in taking a group 
of 10 at random is 0.119 = 0.000 000 0001. The probability of selecting 
10 nines will, of course, be the same as that of selecting 10 zeros. It is 
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obvious, however, that a total such as 45 can be obtained from a large 
number of combinations of digits, and therefore in selecting groups of 
10 digits at random this total will occur with much greater frequency than 
a 0 or a 90. Thus the numbers in the middle of the range will occur with 
greatest frequency and the others with lesser frequency as we approach 
the limits. 

The conditions set up here are comparable in a rough way to those 
affecting biological variates. Each digit may be regarded as a cause, and 
the total of a group as the ultimate effect of the operation of these causes 
on the biological variate. Thus, provided our assumptions are reason- 
able, we should get very much the same type of distribution from the 
totals of the groups of random numbers as we get from measuring 
biological variates. 

Figure 3-1 gives the results in the form of a histogram for the frequency 
distribution of the totals of 1000 groups of 10 random numbers each, 
taken from the tables by Fisher and Yates. A smooth normal curve has 
been fitted by the method described below in Example 3-1. The histo- 
gram is sufficiently like Figure 2-1 to support the conclusion that the two 
sets of conditions are roughly comparable. 

We are now in a position to discuss some of the elementary mathe- 
matical principles of the normal curve and to define some of the statistics 
involved. 

3. The normal distribution. The normal distribution is defined by 


N 
oV 20 
where c is the standard deviation of the population, N is the total number 
of variates, e is the base of the Napierian system of logarithms, and Y is 
the frequency density at the point x, where x is measured from the mean 
of the population. The curve expresses, therefore, the relation between 
Y and x, with Y as the dependent variable. Figure 3-2 shows a normal 
curve superimposed on the histogram of Figure 2-1 for the frequencies 
of areas of bull sperms. 

Equation (1) may be written 


= ао) 1 


Ү = е 


and, putting 2 for Ү(0/№), we have 


— қа оу 


Ут 


mee 
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Now for values of x/o < — 6 or > 6 the values of аге extremely small; 
therefore in practice we can obtain all the information we require about 
Z if it is tabulated for x/o from 0 to 6, proceeding by intervals of 0.01. 
These are given in Sheppard’s Tables of the Probability Integral and are 
published in Pearson’s Tables for Statisticians and Biometricians, Part 1 
(in abbreviated form in Table A-2). For the actual population with 
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FigURE 3-2, Normal curve superimposed on histogram of Figure 2-1. 


which we are dealing, values of Z can then be transformed to Y by multi- 
plying by cN/o, where c is the class interval. In other words, for a 
particular population for which N and c are known, we can proceed with 
a set of tables of Z to plot the theoretical smooth normal curve. 

A smooth curve plotted by the above method is an estimate of the form 
of the infinite population from which the sample was drawn; but what is 
often required is the theoretical frequency distribution corresponding to 
the actual frequency distribution of the sample. That is, we require the 
theoretical normal frequencies for the arbitrarily chosen class values of 
the actual distribution, For this purpose, if N is taken as 1, equation (1) 


becomes e~ loy 
= ——— 4 

вУ2л- 
which can be integrated from x/o = — co to x/g = апу assigned value. 


This gives the area under that portion of the curve, which is represented 
usually by (1 + о) in Sheppard's Tables of the Probability Integral. 
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The integration is started at x/o = — oo because the normal curve never 
actually touches the base line, although, at x/a = — 6, Y is an exceedingly 
small value. The reason for indicating the area by 4(1 + о) or + 3« 
will be seen from an examination of Figure 3-3. For any assigned value 
of x/o the area within the limits of + x/a is represented by а. Therefore, 
if the total area of the curve is 1, the area from x/o = — co to x/o = any 


Area = % (1 + a) 


Ficure 3-3. Normal curve showing ordinates at x/o = + 1 and 
— |. The unshaded area is «, and the shaded area is (1 — «). 


assigned value is $ + ła. The application of this method to a practical 
example is given below under Section 4. ~ 

4. Fitting a normal curve. Fitting a normal curve is demonstrated by 
means of Example 3-1, below. 

Example 3-1. The calculations necessary to fit a normal curve to an 
actual frequency distribution and to determine the normal frequencies 
corresponding to the actual frequencies are given in Table 3-1. The data 
are for the transparencies of 400 red blood cells taken from a patient 
suffering from primary anemia [4]. The transparency is taken as the 
ratio of the total light passing through the cell to the area of the cell in 
cross section. For this distribution X = 7.06 and 57 = 2.45, the latter 


figure is to take the place of c. 
The calculations can best be described by considering each column of 
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the table. These have been numbered at the head of the table for 
convenient reference. 

Column 1. The class ranges аге as described in Chapter 2. Note that 
unit class intervals have been substituted for the actual class intervals in 
order to simplify the arithmetic. After the class ranges are set up, the 
actual frequencies may be entered as in column 10, but they do not enter 
into the calculations at this stage. 

Column 2. In order to be sure of the meaning of the term class limits,* 
refer to any histogram as in Figure 2-1, Chapter 2. The limits correspond 
with the lines bordering the columns of the histogram. 

The mean of the sample is placed іп the range in which it falls. In this 
case the mean is 7.06 and must be placed opposite the class range 6.6 to 
7.5. The remaining limits are then entered by passing in both directions 
from the mean. The class in which the mean falls will require the entering 
of two limits, but for each of the others we enter only the one farthest 
from the mean. 

Column 3. Тһе deviation of the class limit from the mean. 

Column 4. Values in the same line in the previous column, divided by 
the standard deviation which is calculated from the formula 


Ix- xy 
N 


where the deviations (X — X) are based on the unit class interval. 

Column 5. Values of 2 from Table А-2 of ordinates of the normal 
curve. 

Column 6. Corresponding Z values multiplied by N/s”. The complete 
formula is cN/s", but c is equal to 1. 

Column 7. Values of КІ + о) from Table A-1. 

Column 8. Corresponding values of $(1 + a) multiplied by N. 

Column 9. Differences between consecutive values in column 8, 
beginning at the ends and working towards the center. In the central 
class the two differences are added. This column must total to N, 
providing a check on the work. 

Column 10. The actual frequencies. 

* 


* Note that these class limits are not exact; е.р., between the class ranges 0.6 to 
1.5, and 1.6 to 2.5, the exact class limit will be 1.55. Omitting the 0.05 in each class 
simplifies the work and makes a negligible change in the calculated ordinates and 
frequencies. 
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TABLE 3-1 


CALCULATION OF ORDINATES FOR FITTING A NORMAL CURVE, 
AND THEORETICAL FREQUENCIES 


a) @ | @) | (9 (5) (6) (7) (8) (9) (10) 
Gi c d Y N 1 N Theoretical 
lass assi a ГАД Vil Wek N Normal Actual 
Range | Limit 2 Ж e | 242) 24 522) 2 Шеге) Frequency | Frequency 
| | | | 

| 9.56 3.90 | 0.0002 | 0.03 | 1.0000 400.00 0.08 

| 8.56 3.49 | 0.0009 | 0.15 | 0.9998 399,92 0.32 

7.56 | 3.08 | 0.0035 | 0.57 | 0.9990 399.60 1.08 
0.6- 1.5 | | 6.56 | 2.68 | 0.0110 1.80 | 0.9963 398.52 3.16 4 
1.6- 2.5) 1.5 | 5.56 | 2.27 | 0.0303 | 4.95 | 0.9884 395.36 7.92 11 
2.6- 3.5 2.5 | 4.56 | 1.86 | 0.0707 | 11.54 | 0.9686 387.44 16.84 17 
3.6- 4.5| 3.5 | 3.56 | 1.45 | 0.1394 | 22.76 | 0.9265 370.60 30.28 29 
4.6- 5.5| 4.5 | 2.56 | 1.04 | 0.2323 | 37.92 | 0.8508 340.32 44.76 43 
5.6- 6.5| 5.5 | 1.56 | 0.64 | 0.3251 | 53.08 | 0.7389 | 295.56 | 59.16 56 

6.5 | 0.56 | 0.23 | 0.3885 | 63.43 | 0.5910 | 236.40 
6.6- 7.5| 7.06 0.00 0.00 | 0.3989 | 65.12 | 0.5000 200.00 64.96 58 
7.5 | 0.44 0.18 | 0.3925 | 64.08 | 0.5714 228.56 

7.6- 8.5| 8.5 | 1.44| 0.59 | 0.3352 | 54.72 | 0.7224 288.96 60.40 63 
8.6- 9.5| 9.5 | 2.44 1.00 | 0.2420 | 39.51 | 0.8413 336.52 47.56 61 
9.6-10.5| 10,5 | 3.44| 1.40 0.1497 | 24.44 | 0.9192 367.68 31.16 25 
10.6-11.5| 11.5 | 4.44] 1.81 0.0775 | 12.65 | 0.9649 385.96 18.28 20 
11,6-12.5 | 12.5 | 5.44 | 2.22 | 0.0339 | 5.53 | 0.9868 394.72 8.76 9 
12.6-13.5 | 13.5 | 6.44 | 2.63 | 0.0126 | 2.06 | 0.9957 398.28 3.56 4 

13.6-14.5 7.44 | 3.04 | 0.0039 | 0.64 | 0.9988 399.52 1.24 

| 8.44 | 3.44 | 0.0011 0.18 | 0.9997 399.88 0.36 

9.44 3.85 | 0.0002 | 0.03 | 0.9999 399.96 0.08 

10.44 4.26 | 0.0000 | 0.00 1.0000 400.00 0.04 
Total 400 400 


5. Types of abnormality. The types of variation from the normal of 
frequency distributions may be divided roughly into three classes. These 


are: 
a. Skewness. The degree of skewness of a distribution is indicated 


approximately by М us 
ean — Mode 


с 


Skewness = 


where the mode is the position on the base line or X ordinate of a perpen- 
dicular line drawn to the maximum point of the curve. This measure is 
obviously zero for the normal distribution, as the curve is symmetrical 
and the mean and the mode coincide. When the mode is greater than the 
mean, we have negative skewness, and, when less than the mean, positive 
skewness. For negative skewness note that the tail of the curve is extended 
to the left, and, for positive skewness, to the right. 

b. Flatness. In flat curves the shoulders are filled out more fully than 
in the normal curve and the tails are depleted. 
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с. Peakedness. For peaked curves the center is higher and more 
pointed than the normal and the tails are extended. 

In certain distributions we may have skewness as well as flatness or 
peakedness, as indicated by b and c in Figure 3-4, which illustrates some 
common types of abnormality. 

6. Measures of abnormality. An approximate measure of skewness 
can be obtained by the method described in Section 5, but a much more 


Positive skewness Negative skewness 


Peaked Flat-topped 
Ficure 3-4. Illustrating types of abnormality. 


efficient measure is obtained by calculating statistics that are determined 
by the values of the moments of the sample. The kth moment of a 
distribution with respect to a particular origin m is given by [E(X — m)"]/N. 
Thus the second moment about the mean as origin will be [D(X — ХУМ. 

The moments up to the fourth power are convenient indications of 
deviations from normality. For example, if the distribution is symmetri- 
cal, the moments about the mean will be zero for the odd powers, and, if 
the tails are extended, this will tend to make the even high-power moments 
larger than for a normal distribution. 

For the calculation of the moments about the mean as origin it is most 
convenient to obtain first the series of quantities indicated below. 


У(Х) У(Х?) У(Х) У(Х) 
Pee АЛКАШ, 


а 
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Then the moments are 


First moment у = a 
Second moment Ya = ds — а? 
Third moment уз = аз — 3а; + 201° : 
Fourth moment v, = a,— 4ауаз + ба?а — За 
Two measures of distribution type based on the moments are 
Gs = ^L 94 = d 7 


(These are frequently referred to as УВ, and fa) Since the odd power 
moments are zero for a symmetrical distribution, it follows that «s will 
be zero for the normal or any other symmetrical distribution. For the 
normal curve оц = 3, and any deviations from this value are regarded as 
indications of abnormality. Ifa, < 3 the distribution is flat-topped, and 
if aq > 3 it is peaked. 

For the measurement of the departure of frequency distributions from 
normality, R. A. Fisher [1] calculates a set of values related to the 
moments that he refers to as the К statistics. These are given by 


k=» 
N 
бс (= ) i 
N? 8 
ko lara i 
Й № E + 0D»—3XN— nc 
^ (N—IXN—2) N—3 


Two of the k statistics, k, and Ку, require correction for the interval of 
grouping of the frequency distribution. For a unit interval the corrected 
values are given by 


к; ka ae k',=k,+ th 9 


Corrections for other intervals will not be necessary as it is always possible 
to use a unit interval for purposes of calculation. 
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For the k statistics, the measures of curve type comparable to оз and 


о are 
ks Kk 
ҮҮӨ ==) 10 
81 Ka) 82 TA 
These are considered to be estimates from the samples studied of the 
population parameters y, and ys. 
The standard errors of g, and gs are 


Sy, = peus oe 11 
" A (N= 2XN + N + 3) 


Ee ЖТ р 
" (/(М-ЗХМ- 2XN + 3)(N + 5) 


The student should note that g tends to a value of 0 and «, tends to a 
value of 3, as the curve approaches the normal. 

For normal distributions both g, and 2 are zero. g, is a measure of 
symmetry and has the same sign as (mean — mode). Figure 3-4 illus- 
trates positive and negative skewness as indicated by positive and negative 
values of рү. A positive value of g, indicates a peaked curve, and а 
negative value a flat-topped curve. These two types are also illustrated 
in Fig. 3-4. 

The calculation of the k statistics and the measures of skewness, g, and 
£y, is illustrated in Example 3-2. 

Example 3-2. Table 3-2 gives the frequency distribution for this 

- example and the preliminary calculations. Note that the deviations are 
taken from an arbitrary origin of 5. It is not essential that an arbitrary 
origin be employed, but it makes possible a decided reduction in the work, 
especially when the number of classes in the frequency table is fairly 
large. In the formulas given above for а), 4, аз, and a,, X is replaced by 
d, the deviation from the assumed mean. 

From the К statistics, 2; and gą are calculated as follows. 


0.697 258 
Вав =0. 
= Q6) 70 pS or 
0.1094 


бо се EET EIS = 0.451 
&-04  * 5% 


(7) DISTRIBUTION OF MEANS ОЕ SAMPLES 37 
TABLE 3-2 
CALCULATION OF k STATISTICS USING AN ASSUMED MEAN 


Deviation (d) 
from Assumed 


ET Mean fd fit? fa? fas 
ІЗ ЖІ -4 a4 16 - 64 256 
2476 =3 EE 54 — 162 486 
37 03 = — 26 52 — 104 208 
4 25 =i EDS 25 — YE 25 


6.122 1 22 22 22 22 
7 9 2 18 36 72 144 
8 5 3 15 45 135 405 
9 2 4 8: 32 128 512 
X(d)* - Хау (N = 113) -10 282 2 2058 
а, do, аз, 44 — 0.088 4956 2.49558 0.017 6991 18.2124 
— 0.00783 0.662 543 5 0.0063 

— 0.001 386 1 0.1173 

.— 0.0001 

жу, Vas Уа 248775 0.678 8565 18.3359 
Кз, Ка, Ка 2.50996 0.697 258 3 0.1011 
Corrections — 0.083 33 0.0083 
k'a, Кз, k's 2.42663 0.697 258 3 0.1094 


7. Distribution of means of samples. In Section 5 it was noted that 
аҙ and «, are measures of the degree of abnormality in actual frequency 
distributions. In connection with these statistics there is a well-known 
theorem of mathematical statistics that has an important bearing on the 
distribution of the means of samples. This theorem states that, if we 
draw samples of size N from any population and determine the frequency 
distribution of the means of the samples, the values «зу and 2% for this 
distribution will be given by 

4-3 
2а оар = = N +3 


sc 


where аҙ and а, are values for the population sampled. Thus, regardless 
of any abnormalities that may exist in the parent population, the 


* To determine figures for vs, Уз, and уу, refer to formula (6). 
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distribution of the means of samples drawn from that population will tend 
towards normality.* It can be shown that, even for populations having 
definitely skewed distributions, the means of samples for which n is as 
small as 4 are for all practical purposes normally distributed. Shewhart 
[5] performed the experiment of drawing 1000 samples of 4 each from 2 
abnormal populations. One of these was rectangular, and the other one 
right-triangular. In both cases the distributions of means of the samples 
of 4 closely approximated the normal distribution. In this connection 
Shewhart states: “Such evidence, supported by more rigorous analytical 
methods beyond the scope of our present discussion, leads us to believe 
that in almost all cases in practice we may establish sampling limits 
for averages of samples of 4 or more, upon the basis of normal law 
theory." 

This brings our attention back to the near normal distribution of 
Section 2 that was obtained by taking totals of samples of 10 digits drawn 
at random from a table of random numbers. Random numbers have an 
essentially rectangular distribution, that is, the top of the curve is flat, 
and, since the range of the digits is limited to 0 to 9, the theoretical values 
of аз and о, can be deduced. These work out to оз = 0 and a, = 2.01. 
For the distribution of totals of 10 numbers we should have therefore 
ву = 0 and a, = [(2.01 — 3)/10] + 3 = 2.90. Тһе value of a, obtained 
experimentally was 2.88. 

8. The binomial distribution. The binomial distribution is one of the 
most useful distributions for dealing with discontinuous variates. It may 
be defined in practical terms as follows: In a segregating population of 
wheat plants, if the probability of the occurrence of an awned plant is р, 
and the probability of the occurrence of an awnless plant is 4, then the 
theoretical distribution of awnless and awned plants in a large number of 
samples of size n will be given by the successive terms of the expanded 
binomial (q + р)". Assuming q = 1/4 and р = 3/4, on expanding the 
binomial (1/4 + 3/4) (see Section 2d, Chapter 1), where each sample is 
made up of 6 plants, we get 


(09 + 6005) + 150)!Q)* + 200059)» + 150)** + 6@@ + Ф" 


* This is not strictly true, as there are certain theoretical distributions for which 
not all the moments may exist. These distributions are not very likely to be models 
for the actual distributions of biological variates and so are not considered to be im- 
portant in deriving the general law stated above with respect to the means of samples. 
It must be remembered, however, that occasional distributions occur that are very 
abnormal, and consequently the reduction in abnormality for means of samples may 
be quite slow as the size of the sample is increased. 
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which is a little easier to calculate if it is put into the form 
Gye 6 x 306 + 15 x 3*5 + 20 x 3594-15 x 344% + 
5 6 x 3%) + 3Q* 
Using logarithms, the terms are calculated as follows: 
Log term 1 = log (Б = 6 log (2) 
Log term 2 = log 6 + log 3 + 6 log (4) 
Log term 3 = log 15 + 2 log 3 + 6 log @) 
etc. 
It is convenient first to write down 6 log (1/4) and log3. Here we have 
6 log (2) = 4.387 640 0 
log 3 = 0.477 1213 


In the next step the logs and antilogs of the terms are written as follows. 


Awned Plants in Antilog Term or 
Sample of 6 Log Term Probability 
0 4.387 640 0 0.000 244 
1 3.642912 6 0.004 395 
2 2.517973 9 0.032 959 
3 1.120 033 9 0.131 836 
4 1472216 5 0.296 631 
5 1.5513978 0.355 957 
6 1.250 367 8 0.177 979 


A direct method of calculating the terms of a binomial expansion which 
is done on the machine and does not require logarithms is given by 
Snedecor [6]. It is illustrated below for the expansion of (1/4 + 3/48. 


Awned (Qe р" () 

Plants = (0.25)"-" = (0.75)" r, Term 
0 0.000 244 1 0.000 244 
1 0.000 977 0.75 6 0.004 396 
2 0.003 906 0.562 5 15 0.032 957 
3 0.015 625 0.421 875 20 0.131 836 
4 0.062 5 0.316 406 15 0.296 631 
s 0.25 0.237 305 6 0.355 958 
6 0.177 979 1 0.177 979 


We start with 0.25 in the last line but one and multiply successively by 
0.25, writing down the product at each stage. Similarly, the powers of 
0.75 are calculated successively.and put in, beginning with the second line. 
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The third column is n!/r(n — r)!. Multiplying the figures in the second, 

third, and fourth columns gives the term value in the fifth column. 
Another procedure adapted to machine work arises from the fact that, 

beginning with 4” and multiplying successively by (p/q), we obtain 


4", q"3p, q"p, gq" sp, .. 
and if these are multiplied by the corresponding value of (") we have 


the terms of the expanded binomial. It is convenient to obtain q” either 
by logarithms or by continuous multiplication on the machine. For the 
same example the calculations are 


К Gy 
Awned Plants 4 Term 
0 0.000 244 141 x = 0,000 24 
1 0.000 732423 x 6 = 0.004 39 
2 0.002 197 27 x 15 = 0.032 96 
3 0.006 591 81 x 20 = 0.131 84 
4 0.019 7754 x 15 = 0.296 631 
E 0.059 3262 x = 0.355 96 
6 0.177 979 x 1-0.17798 


We can now make a practical application of this distribution. If p 
and q are the correct probabilities for a given population, it is easy to 
determine the probability, in taking a random sample of 6 plants, of 
obtaining 2 awned plants or fewer. This is obviously the sum of the 
probabilities of obtaining 0, 1, and 2 awned plants, which works out to 
0.038. In taking 1000 samples of 6 plants the expected number of samples 
containing 2 awned plants or fewer is 38. 

The mean and standard deviation of a binomial distribution can be 
worked out in the same manner as for any frequency distribution. Thus, 
assuming that there are 1000 samples of 6 plants, the theoretical distribu- 
tion would be as follows. 


Awned Plants 


in Sample Frequency 

x 

0 0.2 
1 44 
2 33.0 
3 131.8 
4 296.6 
5 356.0 
6 178.0 


Total 1000 


——— ——— 
n АНР НР 
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E(X) = (0 x 0.2) + (1 x 4.4) ++ ° -+ (6 х 178.0) = 45002 
4500.2 
Me d plant jet LAG 
an awned plants per sample = 1000 50 


E(X?) = (0.2 х 02) + (4.4 x 18) +- + + + (1780 x 6?) = 21,3762 


[XX)P (4500.2)* 
1000 1000 


= 20,251.8 


Then 
E(x?) = 21,376.2 — 20,251.8 = 1124.4 


and 


Actually in this example we are measuring two parameters of a theoretical 
distribution for which p, 4, and п are known. Since p and n completely 
define a binomial distribution, the mean and standard deviation must be 
related to them mathematically. Therefore, by mathematical derivation 
it can be proved that 

X= рп 13 


where X stands for the mean number of occurrences рег sample of the 
event for which the probability isp. Itcan also be shown that the standard 


deviation is given by АУ; 
I= V pqn 14 


Applying these formula to the example above, we have 


As would be expected, the agreement with the direct calculations is 
perfect. 

It will of course be understood that in certain applications the values 
of p and 4 are unknown, in which case the direct method as illustrated 
first would enable us to obtain estimates of the values of p and q from 
the data. 

When л is small the effect of the values of p and 4 оп the terms of the 
expanded binomial is fairly clear. When p and q are equal the terms are 
symmetrical, but when p and q are not equal the distribution is skewed. 
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A very good picture of the effect of p, 4, and n on the shape of the 
binomial distribution can be obtained from the formulas for æg and оц 
appropriate to the binomial. These are 


qc aia о 15 

Урап пра 
We note first that « = 0 when р = q, and that its value increases as р 
and q differ. However, regardless of the difference between p and q, 
аҙ becomes quite small as n increases. A study of оц reveals that varia- 
tions in p, g, and n make relatively small changes in its value, but when 
there is a marked difference between p and 4, and л is small, there is a 
tendency for the distribution to be flat-topped. | 

If we set up a histogram for a binomial distribution where n = 9, there 
are 10 classes and 10 columns in the histogram. On the same length of 
base line, a histogram for n = 29 will have 30 columns, and the columns 
will be 1/, the width of those for the example with n = 9. Continuing to 
increase n and erecting histograms on the same base line, it is clear that 
the tops of the histograms will tend towards a continuous curve. Actu- 
ally, as n increases, the binomial distribution approaches and merges into 
the normal distribution that we have already studied. 

9. The Poisson distribution. The Poisson distribution is a theoretical 
distribution of great practical importance. It can be illustrated best by 
a specific example. 

Suppose that a carload of wheat is being sampled and that each sample 
consists of 1 pound or approximately 16,000 seeds. There are seeds of 
ragweed in the wheat, and each sample is examined carefully in order to 
determine the number of ragweed seeds per pound. If the actual number 
of ragweed seeds per pound in the entire carload is 3, then, if a number 
of samples are taken, a variation in number of ragweed seeds around 
this figure will be expected. It will be apparent from our study of the 
binomial distribution that the theoretical frequency of weed seeds per 
sample of 16,000 would be given by 


(4 + De 


where p is the probability of the occurrence of a weed seed and q is the 
probability of non-occurrence. It should be noted, however, that in two 
respects this differs markedly from the type of problem to which the 
binomial distribution was applied. In the first place the number in the 
sample is very large, and in. the second place the probability p is very 
small. When we have these conditions, it can be shown mathematically 
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that the consecutive terms of the distribution can be expressed by the 
following. 
m? т" 
Ce EM, s 


where e is the base of the Napierian system of logarithms and m is the 
mean number of occurrences of the event per sample. This is commonly 
known as the Poisson distribution. In more detail it may be referred to 
as the Poisson exponential limit to the binomial. 

One of the most useful properties of the Poisson distribution is that 
the variance is equal to the mean. For this reason a Poisson distribution 
is completely specified by the mean, and consequently in the expression 
above for the terms of the distribution the mean is the only parameter 
given. 

It is of interest to note also the effect of p and л on the shape of the 
Poisson distribution. Since 

1 1 
== апа Qo 17 


Vn, { пр 


it is obvious that as р decreases we have greater skewness, and as n 
increases the distribution approaches symmetry. The formula for оц 
shows that its value will never be less than 3; therefore the distribution 
never becomes flatter than the normal. As the quantity np becomes 
appreciable, the degree of peakedness of the curve becomes very small. 
Examples to which the Poisson distribution may be applied can in most 
cases be determined by the nature of the sampling problem. In the 
example given, the probability of obtaining a weed seed in selecting any 
one seed at random from the carlot is quite small, but the actual number 
of weed seeds іп a I-pound sample is appreciable. Since the mean is 
given by pn where n is the number in the sample, the conditions under 
which a Poisson distribution should be obtained can be defined as follows. 
The probability of the occurrence of the event must be quite small, but n 
must be large enough so that pn is an appreciable quantity. For example, 
suppose that there are approximately 10,000 heads of wheat per plot in 
a series of 200 plots. On examining the plots it is noted that a small 
proportion of heads are killed by wheat stem maggots. Counts show 
that the mean number of injured heads per plot is 6. Since pn = 6, we 
can determine that p = 6/ 10,000, and it is immediately obvious that the 
distribution of injured heads will follow the Poisson distribution. 
Applications of the Poisson distribution are important in many sampling 
procedures. Further applications are given in a later chapter, but it will 
be of interest at this point to consider the application of the Poisson 
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distribution to seed sampling. Suppose that a farmer has a quantity of 
seed which has 3 weed seeds to the pound. What is the probability that 
a ]-pound sample of seed taken from the bulk lot will contain 6 or more 
weed seeds, 5 per pound being the tolerance limit? This, in other words, 
is the probability that the farmer's seed will be rejected on the basis of 
the results from the examination of the I-pound sample. In order to 
solve this problem we must set up the theoretical distribution for the case 
in which m = 3. This has been done for us and is available in Pearson's 
Tables for Statisticians and Biometricians. On looking up the table and 
entering under m = 3, the following values are found. 


Number of Weed Seeds Probability of 
per Sample Occurrence 
0.050 
0.149 
0.224 
0.224 
0.168 
0.101 
0.050 
0.022 
0.008 
0.003 
0.001 
0.000 


+ س دم نی‎ о 


= Oeo -0tu 


Total 1.000 


Letting P — probability of acceptance of a sample and Q — 1—P 
— probability of rejection, we note that Q is the sum of the probabilities 
of obtaining 6, 7, 8, or more weed seeds per sample. This is 0.084, and 
hence P = 0.916. Thus, if 5 is the tolerance limit for seed of a certain 
class and a single I-pound sample is taken, the probability that the lot 
will be rejected is 0.084. Another way of stating this proposition is to 
say that the probability of rejection is about 1 : 12; or, if a large number 
of such samples is presented for inspection, the expectation is that only 
1 out of 12 will be rejected. 


10, Exercises. 


1. Calculate the ordinates (Y) and the theoretical normal frequencies for the fre- 
quency distribution from the exercises of Chapter 2. Totaling the theoretical frequencies 
will provide a check on the work. 

2. Make a histogram for the actual frequencies of Exercise 1, and plot over this 
graph the smooth curve arising from connecting the end points of the ordinates. 
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3. Put equation (1) of Section 3 equal to 


Y = Ce- 'Ix«lo* 


and show how ø affects the shape of the curve. 
4. If the mean of a population is 21.65 and с = 3.21, find the probability that a 
variate taken at random will be greater than 28.55 or less than 14.75. Р = 0.0316. 
5. If, for the population of Exercise 4, the standard deviation of the mean of a 
sample of 400 variates is c/ V 400, find the probability that the mean of a sample taken 
at random will fall outside the limits 21.33 to 21.97. P = 0.045. 
6. Test the following distributions for departure from normality. 


(a) xX 1 E ОЗУ ОЕ СИ 12, 14-49 16 
Ж 1 57 185 217 177 126 87 54 30 20 14 П 127,512 
() Х 17 АА OSS 1 Ж8 7192010: 711712; 13214 15 16 
f 21 4s ЛІП 20750794 21-10) 1. 5772 2 
(c) X 15272 4 24667 ТВ 19-10, 11-12. 18: 14 15 16 
if 1 7 13 19 23 26 27 28 26 24 22 17 14/7 9- ӘРІ 


(а) gy 1.360, gy = 2.143. (b) у = — 0.327, ga = 0939. 
(© gı = 0.107, ga = - 0.766. 


7. Suppose that you have a bag of beans of which 20% are black and the remainder 
white, Determine the probability in drawing a single sample of 8 beans that it will 
contain 6 or more white beans. P = 0.7969. 

8. A sample of 1000 seeds of wheat is drawn from a supply and grown in order to 
observe the proportion of off-type plants. If the supply contains exactly 3 off-types 
per 1000 plants, determine the probability that 3 off-types will be found in the test. 

9. Table 3-3 gives the frequency table for 1000 totals of 10 random digits. Calculate 


the mean and standard deviation, taking an assumed mean at 45. 
X = 45.32. s 9.148. 


TABLE 3-3 


FREQUENCY TABLE FOR 1000 TOTALS OF 10 RANDOM DIGITS 


Class Range Frequency Class Range Frequency 
14-16 2 44-46 137 
17-19 1 47-49 119 
20-22 3 50-52 104 
23-25 i 53-55 86 
26-28 16 56-58 64 
29-31 37 59-61 40 
32-34 56 62-64 18 
35-37 77 65-67 9 
38-40 86 68-70 5 

1 


41-43 130 71-73 
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CHART RRs: 


Tests of Significance 


One of the most important functions of the statistical method is to 
provide “tests of significance.” The quoted phrase may express a con- 
crete and definite idea on first reading, but on further thought it becomes 
necessary to clarify its meaning and particularly to define what is meant 
by a test of significance in the statistical sense. The word significance 
has a meaning in every-day usage that in only a broad sense is similar to 
its meaning in the statistical phrase. 

Suppose that we have a coin and that there is some question as to 
whether or not it is unbiased, i.e., whether or not there is an equal proba- 
bility on throwing the coin of its turning up heads or tails. The first 
thought that suggests itself is to throw the coin a number of times and 
note what happens. If we carry out a trial and there is a preponderance 
of heads or tails, we һауе some evidence to the effect that the coin is 
biased, However, without further analysis this experiment leaves us in 
a very unsatisfactory state of mind because we do not know what pre- 
ponderance of heads or tails indicates definitely that the coin is biased, 
nor do we know what variation from the expected equality of heads and 
tails can be expected even if the coin is true. It is at this point that we 
draw on our knowledge of mathematics and at the same time adopt a 
logical procedure that keeps our thinking clear. 

1. Logic of tests of significance. Suppose that a coin thrown 16 times 
gives 13 heads and 3 tails. The first process of developing a logical test 
of significance is to ask the question: How frequently will such a result 
be obtained if the coin is unbiased? With a true coin the probability of 
heads (p) equals the probability of tails (4), where р + q = 1. The 
distribution of heads and tails should be given therefore by the expansion 
of the binomial (p + 4/6 = (1/2 + 1/26. Expanding to get the first 5 
terms, we have 

Number of heads 16 15 14 13 12 

Probability (P) 0.000015 0.000244 0.001 831 0.008 545 0.027 771 


and the probability of getting 13 or more heads in throwing a true coin 
16 times is the sum of the first four terms, Ў 


0.000 015 + 0.000 244 + 0.001 831 + 0.008 545 = 0.011 
47 
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The result observed would occur with a true coin only about once in 100 
trials, where each trial represents 16 throws. That it should occur the 
first time in an actual throw would seem to cast considerable doubt on 
the trueness of the coin. 

It should now be clear that we have at our disposal, from our knowledge 
of the theoretical distribution of heads and tails for a true coin, a definite 
measuring stick with which to judge the significance of results with an 
actual coin. If the theoretical distribution gives P = 0.4, it is obvious 
that the result obtained is not in any way unusual and that there is no 
evidence of bias. On the other hand, if P = 0.001, we would be reason- 
ably certain that the coin is not true. 

This brings us to a consideration of values of P. At what level shall 
we say a result is significant? In order to answer this question the most 
realistic approach is to consider the results obtained with a large number 
of trials with unbiased coins. Let us decide that if P < 0.05 we shall 
consider the coin biased, but if P > 0.05 we shall classify it as unbiased. 
Now in 1000 trials with a true coin it is expected that 50 will give values 
of P < 0.05. Thus in 5% of the trials we shall classify the coin as biased 
when it is actually unbiased. If we are prepared to be wrong in 5% of 
the trials a level of P equal to 0.05 is satisfactory, but if we do not wish 
to be wrong as often as this it is necessary to use a lower value. 

At this point we should not fall into the error of deciding that a very 
low value of P is the most desirable one, on the assumption that it will 
reduce errors of judgment. The reason for this is that one important 
factor has not yet been taken into account. In actual practice we are 
very unlikely to be testing a large series of perfect coins. In order that 
there shall be some point in making the tests it must be assumed that there 
are at least some bad coins present, and the object of the tests is to 
eliminate them. It should be obvious then that there are two kinds of 
errors that can be made. Owing to the pertinent suggestion of J. Neyman 
and E. S. Pearson [7], these are referred to as errors of the first kind and 
errors of the second kind. We make an error of the first kind when a 
good coin is classified as bad, and an error of the second kind when a 
bad coin is classified as good. Now let us see how the significance level 
affects the proportion of the two kinds of errors. 

A simple way of demonstrating the effect of using different levels of 
significance is to assume that the coins being tested consist of true coins 
for which p = 4 = 1/2, and bad coins for which р = 0.8 and q = 0.2. 
Figure 4-1 shows the theoretical distribution of heads for the two types 
of coins. Let us assume arbitrarily that the level of significance is taken 
at P = 0.038, representing the trial wherein 12 or more heads are obtained 
іп 16 throws. The distributions for (0.5 + 0.5) and (0.8 + 0.2)! аге 
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shown in Figure 4-1. The single-hatched area of the distribution for 
good coins representing 3.8% of the total is shown at the lower left of 
the figure. This represents the percentage of good coins that are called 
bad. The double-hatched area represents the percentage of bad coins 
that are called good. The effect of changing the level of significance will 


1 
| Level of significance 
| 
| 


Good coins 
p-q-H 


96.296 


Bad coins 


T i ы 20 а E ТӨ ISTIS PT AR 


Heads 16 15 14 13 12 11 10 9 ТҰРУ SO) SEATS 2: 1,70 


Ficure 4-І. Distribution of heads, іп 16 throws of 2 types of coins. 


now be obvious. If we move it to the right to include one more column, 
only 8.2% of the bad coins will be considered good but 10.5% of the 
good coins will be considered bad. Moving the line one column to the 
left gives 40.2% of the bad coins classified as good and 1.1% of the good 
coins classified as bad. This can be summarized very conveniently as 
follows. 

Percentage of Good Percentage of Bad 


Level of Significance, Coins Classified Coins Classified Total 
Heads per 16 Throws as Bad as Good Error 
15: P — 0.0003 0.03 85.9 85.9 
14: P = 0.002 0.2 64.8 65.0 
13: P = 0.011 1.1 40.2 41.3 
12: P = 0.038 3.8 20.2 24.0 
11: P = 0.105 10.5 8.2 18.7 
10: P = 0.227 22.7 2.7 25.4 


9: P = 0.402 40.2 0.7 40.9 
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For this particular example there is a point of minimum error at 
Р = 0.105, but this does not tell the »whole story. The result would be 
different if the bad coins were represented by a distribution of (0.7 + 0.3)! 
and would be a very complex affair if there were a wide range of bad coins 
differing in amount of bias. Furthermore there is the question of the 
seriousness of the two kinds of error. In certain cases an error of the 
first kind may be more serious than an error of the second kind, and 
hence we would require a lower value of P, for our level of significance. 
When errors of the second kind are more important, we would require a 
higher value of P. 

Serious thought on this subject leads to conservatism in setting levels 
of significance. In any event it is clear that setting too low a value of P 
will increase errors of the second kind. This is a most important general- 
ization and should not be lost sight of in any applications of tests of 
significance. As a general working rule a value of P equal to 0.05 is 
considered satisfactory. It is usually referred to as the 5% level of 
significance. 

2. A test for bias in either direction. In the preceding discussion of 
testing coins the question asked was: “Is the coin biased such that heads 
turn up more frequently than is expected?” In other words we did not 
ask the simpler question: “Is the coin biased ?" The latter would involve 
testing for a bias towards the production of either heads or tails. To 
answer this question it is necessary to determine the frequency for a good 
coin of obtaining a result as bad or worse than that observed. Thus, if 
we get 13 heads and 3 tails, the result at the other end of the scale repre- 
senting a similar degree of bias would be 3 heads and 13 tails. The 
frequency of results as bad or worse than the one observed would be given 
by adding the theoretical frequencies at both ends of the distribution. In 
this example, since the distribution is symmetrical, it is simply a matter 
of doubling the frequency for 13 or more heads, so we get P = 0.022. 

It will be recognized that the simpler question with respect to the 
existence or non-existence of a bias is the more general one, and therefore 
in tests of significance it is usually the type of question asked, There is 
no difficulty involved here in deciding how to get the correct value of P 
if one is clear as to the question being asked. Lack of understanding on 
this point has, however, caused a good deal of confusion. 

3. Tests of significance involving the distribution of 1. In experimental 
procedure, the making of surveys, etc., the tests of significance to be made 
are considerably more complicated from the theoretical standpoint than 
the one we have just described in connection with. the testing of coins. 
The actual performance of these tests is not difficult, however, owing to 
a great deal of the theoretical work having been done in advance and 


” 
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tables prepared for the purpose. At this stage in our study of statistics 
we are concerned chiefly with the underlying logic of the test and shall 
approach it from that standpoint. 

Suppose that the results given in Table 4-1 of the yields in plot tests of 
2 varieties of wheat are brought into the statistical laboratory and we 
are asked to examine the data and come to some conclusion as to whether 
or not the 2 varieties are essentially different in yielding ability. 


TABLE 4-1 


YIELDS OF 2 WHEAT VARIETIES AT 12 STATIONS 


Station 1 едм, 5 6 7 8 97710: 11212 
Variety X1 244 27.9 282 19.8 23.1 22.9 25.6 28.7 26.2 25.7 37.0 31.5 
Variety X; 17.5 15.1 21.6 18.2 21.6 13.7 24.8 27.8 25.2 19.2 340 25.2 


In the first place we know from previous study that the yields of wheat 
varieties are most likely to be normally distributed. Furthermore an 
examination of the literature on the distribution of the yields of grain 
from field plots can be made and further evidence obtained that there is a 
definite tendency towards normality in the distribution of a variate of 
this type. In any event we know that the means of samples of 12 variates 
will be approximately normally distributed, regardless of the kind of 
distribution of the population from which the sample is drawn. It can 
be assumed safely therefore that, if we obtain a large number of samples 
of 12 yields each of either variety X; or variety X,, the means of these 
samples will be normally distributed. T 

The logical hypothesis to set up for testing is that the mean difference 
between the 2 varieties is zero. We picture therefore, arising from our 
hypothesis, a very large normal distribution of the differences X, — X; 
abouta mean ofzero. Then from our knowledge of the normal distribu- 
tion (Chapter 3, Section 3) we know that, having the mean (и) and the 
standard deviation of this distribution (cma), where the subscripts indicate 
that it is a standard deviation of a mean difference, we can determine the 
probability of obtaining a value of Х,- X, as great or greater than the 
one observed. For example, if the standard deviation oma = 2, and 
X, — X, = 4, the ratio of these two quantities, which is merely a measure 
of X,— X, in terms of the standard deviation, is equal to 2, and tables 
of the probability integrals of the normal curve give for this ratio 
МІ + ә) = 0.9773. Therefore the probability of obtaining a positive 
value of X,— X, as great or greater than the one observed is 
] — 0.9773 — 0.0227. 

When we apply this method to an actual example, however, it is obvious 
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that we do not have the pertinent facts for making the test outlined. It 
will be remembered that our mental picture arising from the hypothesis 
is a normally distributed population about a mean of zero with a given 
standard deviation с. It would be necessary to know this standard 
deviation in order to make the test work perfectly. In other words, 
knowing the standard deviation, we could draw a large number of samples, 
each one giving us a value of X,— X, We could then find the ratio 
(X,— Х./б,,а for each sample and by means of the tables find a value of 
P which would be the probability of obtaining a value of X,— X, as 
great or greater than the one observed. On checking through the results 
for a large series of samples for which the hypothesis is true, we should 
find that we had obtained values of P < 0.05 for 5% of the samples. 
This would be exactly in accord with theory and would prove that the 
test is correct. However, we do not know oma, and any test made must 
be based on an estimate of oma determined from the sample. 

The first problem is to decide on a method of estimating oma, and this 
involves estimating o for the population of individual values from which 
the samples are drawn and equating this to an estimate of oma: А logical 
decision is to pool the estimates of c obtained from the 2 samples. 


Representing estimates of с by s, we have, from sample 1, 
боор а 
: n— 1 n— | 


and for sample 2 
У(Х. 3): xe 


а XT A 2 
B n—1 n—1 
Now the estimate of the standard deviation of a difference 
cs X(X,— X)— (X,— X9P ХК Х)-(%- Z)? 
4 n—1 n—1 
= EQ — xy? 3 
n—1 
On expanding the last expression we have 
nue хуа Dien 2 Ха) 4 
n- теі n—l 
У(хүхә) 
= 5? 2 2 ——— 5 
5° + Sg a 


and we see that 5,2 is equal to the sum of the variances for X, and X, 
less a quantity depending on X(x,xj, the sum of the products of the 
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deviations from the mean of the variates in the 2 samples. More infor- 
mation on this quantity will be given in our study of regression and 
correlation, but for the present it is sufficient to point out that, as we take 
pairs of values of x, and х», if they tend to be both positive or both 
negative, the sum of products will be a positive quantity. On the other 
hand, if positive values of one variable are usually associated with negative 
values of the other variable, the sum of products will be negative. There 
is the intermediate situation where about half of the positive values of x, 
will be obtained with negative values of x;, and the other half with positive 
values of хь. The negative values of x, will be similarly divided, and the 
resulting sum of products will be close to zero. The latter situation is 
described by saying that the 2 variables are independent. 

In our example the yields come from pairs of plots at a series of stations, 
and if we examine the deviations of each yield from the mean of the 
variety it is noted that positive and negative deviations tend to occur 
together. This means that the sum of products X(x,x,) will be a definite 
positive quantity and therefore should be taken into account in equation 5. 
We can calculate s,2 from equation (4), but it is more convenient to 
make use of the relation 


Ха-ха-ха р ОТ ХОР 


in order to determine the numerator for the calculation of 5,2 from 
equation (3). 
The differences X, — Xp, аге 


690 125 7667 16 МӨЛ 927 08 09 10 <65. 30.763 


and their total or XX, — XX, = 57.1. We then have 


571% 
X(x,— xf = 437.85 — Spem 166.15 
and 
Pr c 16605 = 15105 
11 
Also 
si 15/105 
Sua = = =z = 8 
And 


Sma = ۷1.2588 = 1.12 


Having made an estimate of sma we can calculate the ratio GE, — Xa) Sea: 
This is known as the statistic t. The question we then have to ask is 
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whether or not this ratio can replace (X, — Х,)/6, for the probability 
calculations. This is easily answered. Gma is а population parameter 
and is therefore constant from sample to sample. On the other hand, 
Sma is estimated from the sample and will vary. It is a statistic and is 
subject to the usual sampling variations and has a distribution all its 
own. We note particularly that if л is large the values of Sing will be 
quite consistent, whereas if л is small the variations in Sma will be pro- 
portionately greater. This would indicate that if п is large the error 
involved in replacing oma with Sma will be negligible, but if n is small the 
error may be appreciable. Actually what we require for a solution of 
this problem is the sampling distribution of / for different values of л. 
If we knew this, we could determine for any given problem the probability 
of obtaining, when the hypothesis is true, a value of г as great or greater 
than the one observed. 

This problem was first solved by “Student” [9] in 1908, and he published 
tables of the distribution of Z which is a statistic closely related to ¢ and 
from which tests of significance could be made. Fisher [2] gave the 
distribution of and supplied a formal proof which covered the distribu- 
tion of Z. 

The first point to be noted in connection with tables of probability 
based on the distribution of t is that in theory there is a different distribu- 
tion for each value of n—1, the degrees of freedom available for 
estimating 02. То cover the entire field n — 1 tables would be required, 
where п — 1 could take any value from 1 to co. This is clearly impossible, 
but on examining the distribution we note that beyond n — 1 — 30 there 
is very little change and from that point on it is satisfactory to use a table 
of t based on n — 1 = oc. 

In order to cover the whole field, however, 30 tables would be required, 
each as large as the tables of the probability integral for the normal curve. 
This would also be an enormous task. It has been abbreviated by 
Fisher [3] and Fisher and Yates [5] through the preparation of tables 
giving values of ¢ through the range of n— 1 = 1 to 30, and then for 
40, 60, 120, and co, for given values of P. The values of P are those in 
. which we are chiefly interested in making tests of significance, such as 
0.01, 0.05, and 0.10. 

Returning now to the numerical example, we find that 


On looking up Table A-3, we observe that, for n — 1 = 11, the value of 
t for P = 0.01 is 3.11, and since our / is greater than this we know that 


| 
| 
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Р is less than 0.01. There are definite indications that the 2 varieties are 
really different in yielding ability. 

Summarizing the procedure for applying the / test to the mean difference 
of a series of paired variates, we have 


s= Xa — ха)? 1 
n—i 
where 
— SAYS. 
(x х) = D(X Xy fen n 9 8 
gå 

Sina = = 9 

X,— X, a n(n— 1) 
Sma (у? 204 — хз) i 


The table of t is entered for n — 1 degrees of freedom. 

4. Application of the ¢ test when the variates are not paired. The 
example given above was for a series of pairs of plots, and in working 
out a method of estimating Sma? we noted that it was necessary to take 
into consideration the value of Х(хұх) which enters into the equation. 
When the variates are not paired, this term can be disregarded because 
it follows that, if the variates X; and X, are drawn independently, the 
positive and negative products хухз will cancel out. We have then the 
simple equation 

Sj — Spt Sse 11 


There are two variations of this case: (1) where the number of variates 
in the 2 samples are equal, or m = Ma; and (2) where m zn For 
these variations the formula for estimating Sma” are slightly different. 

(1) Variates not paired, my = n, = n. 


Bee + Ex? 
(ab сара ln enar T 12 
Sa Xn— 1) 
2 2 2 
ые Eee) d 
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And for calculation we use 


(xy (ZX) 
n 


15 


Уха = 2X = and xf = л — 


In entering tables of t, note that the degrees of freedom here are 2(n — 1). 
(2) Variates not paired, лү # л. 


Sa Par n 2t 16 
ny + n-—2 
San See (2 i ") 17 
лүп; 
N > w [ан іль 0) 
t X,— X, " — 18 
Sma о Ш lez a nj (Xx, + Xx) 


In entering tables of f, the degrees of freedom аге т; + 7 — 2. 

The test applied to sets of variates that are not paired is based on the 
assumption that the samples arise from populations having the same 
variance and that the differences observed between s;? and s, are due to 
random sampling only. Putting this another way, what we actually wish 
to test is the significance of the difference between the means, but what 
we are actually testing is the significance of 7, the ratio of the mean 
difference to the pooled standard deviations. This test can apply strictly 
to the mean difference only if the standard deviations can be assumed to 
be the same in the populations sampled. A method of dealing with the 
situation where the standard deviations can be assumed different is to 
apply what is known as the Behrens-Fisher test, tables for which are 
given by Fisher and Yates [5]. In the great majority of examples it is 
somewhat simpler to apply approximate tests as indicated below. 

а. When nj = n, = n, it is usually sufficient to assume that the variates 
are paired, in which case only n — 1 degrees of freedom are available for 
estimating /. The value of / required for significance is increased accord- 
ingly. 

b. When п, # ng, an approximation suggested by Cochran and Cox [1] 
is quite satisfactory. We calculate : 


19 


5 
Sw Sg? = — = We 20 


——————————— ——— f 
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Then, if Y, — X, = d is the mean difference, 
d 


t= ——— 
Vwi + Ws i 
And the ¢ value required for significance at a given level is given by 
; hM, + tow. 
piece alas SSS шшр 22 


Wy + We 


where й, is the 5% point value for n; — 1 degrees of freedom and 1, for 

п, — 1 degrees of freedom. This method is demonstrated in Example 4-2. 
5, Example 4-1. Application of / test when variates are not paired. 

The data of Section 3 can serve to illustrate the methods of calculation. 
First we have 


321.0? 


xà = 24.42 + 275% + ++ + + 31.5%——у— 


= 8806.30 — 8586.75 = 219.55 
263.9? 


Ex = 17.5? + 15.12 ++ + + + 25.2? 
= 6168.91 — 5803.60 = 365.31 
,. 219.55 + 365.31 


Mv 


— 26.5845 


2 
Sag = 26.5845 (5) — 4.4308 


26.75 — 21.99 
Е ато = 1.07 


Here we have 22 degrees of freedom for the estimation of t; therefore we 
look up г at the 5% point for 22 degrees of freedom. We find this to be 
2.08, and hence according to this test the difference observed is not signifi- 
cant. In comparison to the test made in Section 4, the difference is due to 
our not taking into consideration, in calculating s,”, the third term on the 
right of equation (4). This has increased s, to the point where / is no 
longer significant. The result illustrates the value of correct design and 
analysis of an experiment. In the first place, the pairing of comparisons 
wherever possible should be adopted as a method of reducing error. In 
the second place, when comparisons have been paired, it is essential that 
the correct method of analysis be followed. 
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A further point is that, when pairing of comparisons is not expected to 
reduce the error, there is a loss of precision from pairing in that fewer 
degrees of freedom are available for estimating t and a larger value of / 
is required for a given degree of significance. 


6. Example 4-2. Application of the / test when m n, and the : 


variances of the 2 populations are assumed different. The following data 
are from J. A. Anderson's Protein Survey of Western Canadian Wheat, 
1950 Crop. The figures in the first row represent 8 values at random 
from a large series of protein determinations for individual stations in 
Saskatchewan (sample 1). The second row represents a similar series of 
14 values from Manitoba (sample 2). 


Sample 1 15.1 14.3 11.5 14.5 15.4 12.5 14.6 16.6 
Sample 2 12.2 12.5 11.2 12.6 11.0 11.6 12.0 12.5 11.8 12.4 11.5 12.0 11.6 12.7 


ХХ, = 145  4,—1431 
d= 1431 — 11.97 = 2.34 


DX,= 167.6  X,—1197 


2 
Ex? = 1657.13 — шз 1635 
3 : 
En 57а 2.6214 sg? ЖаН аа 
qus P 8 
А 2 
Xx = 2010.20 — = 9 = 3.79 
3.79 0.2915 


ше =. асс lind ipi 
эй =>, = 02915 зл, g = 0021 
ЖЕЛ, r нс 
0.328 + 0.021 
= 2.36 = 2.16 


2.36 x 0.328 + 2.16 x 0.021 
0.328 + 0.021 


, 


2.35 


Since / is well beyond the approximate 5% point value of 2.35, the 
difference in protein content can be considered significant. 

In this example there is a very decided difference in the mean squares 
for the 2 samples. The ratio is 2.6214/0.2915 = 9.0. In Chapter 5, 
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which deals with the analysis of variance, it is shown that a test of signifi- 
cance can be applied to this ratio. The 5% point for this ratio can be 
shown to be 2.84, and the 1% point 4.44. Therefore here we have quite 
definite evidence of a difference in the variances of the 2 populations 
sampled and full justification for the more exact test. With the ordinary 
method we would have taken the 5% point of t for 20 degrees of freedom, 
which is 2.09, so it is evident that the more exact test requires a greater 
difference between the means for a given level of significance. 

7. The null hypothesis. The / test of Section 3 involved our setting up 
the hypothesis that the difference between the variety yields is normally 
distributed about a mean of zero. Professor R. A. Fisher has aptly 
termed this the null hypothesis. It is a very simple concept and should 
cause no confusion. In testing a coin our null hypothesis is that the coin 
is unbiased. For an unbiased coin we can say that the distribution of 
heads is given by the expansion of the binomial (1/2 + 1/2)", where л is 
the number of throws. Thus the null hypothesis is the basis for our 
deduction of the theoretical distribution. It should be kept clearly in 
mind that the theoretical distribution has nothing to do with the actual 
distribution obtained but flows directly from the null hypothesis. 

All tests of significance require that a hypothesis be set up. The 
second consideration is that the theoretical distribution generated by the 
hypothesis be known. If the distribution is unknown, it may be possible 
to derive it by mathematical methods. The third step is to compare the 
result obtained with the theoretical frequency of such results according 
to the hypothesis and come to some conclusion as to its statistical 
significance. 

8. Fiducial limits. The null hypothesis is only one of an infinite 
number of hypotheses that can be set up. For example, in the test made 
in Section 3 we could have set up the hypothesis that the difference between 
the varieties is normally distributed around a mean of 3.00. Putting 


X,— X, = d, the null hypothesis involves calculating ¢ from the formula 


t= :23 


Sma 
The hypothesis that the mean is 3.00 would involve calculating ¢ from 
| d— 3.00 


БІР 
Sma 


In other words, the general formula for / is 


24 
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where any value whatever can be given to m according to the hypothesis 
required.* 

The question now arises as to what use can be made of the above 
generalization. Professor R. A. Fisher has made the extremely valuable 
suggestion that, in addition to tests of the null hypothesis, further informa- 
tion can be gained by an application of formula (24) to determine 2 values 
of m such that the calculated / is equal to the / in the table at a given 
percentage point. These values of m are known as the fiducial limits. 
For example, using the data of Section 3, 


476—m 
112 


p= 


giving 
m, = 4.76 + 1.121 


m, = 4.76 — 1.121 
Then, if t at the 5% point for 11 degrees of freedom is 2.20, 
m, = 4.76 + 2.20 x 1.12 = 7.22 
m, = 4.76 — 2.20 x 1.12 = 2.30 


The meaning of these fiducial limits is made clear by noting that, if we 
select for m any values equal to or greater than 7.22, the value of P 
obtained from the t test will be < 0.05. Similarly any value of m 
equal to or less than 2.30 will yield a P < 0.05. Stating this proposition 
in another way, any values of m within the range 2.30 to 7.22 will not 
differ significantly from 4.76 at the 5% level. Any values outside the 
range 2.30 to 7.22 will differ significantly from 4.76 at the 5% level. We 
have therefore reasonably good assurance that the true mean difference 
between the variety yields lies between 2.30 and 7.22. 

It is desirable in connection with interpreting the meaning of fiducial 
limits to picture, drawing a large number of samples, as in testing 2 
varieties, and determining the fiducial limits from each sample. It is to 
be expected, of course, that the fiducial limits will change from sample to 
sample, but in 95% of such determinations the fiducial limits calculated 
will be expected to contain the true mean. It is of value to picture а 


* It might be more correct to say that the general formula for / is 
KM ў 
% 5 


zr 
X 

In (24), d is the variable corresponding to X and s,,q corresponds to sy, so the formulas 
are really identical. 
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vertical line representing the true mean. The fiducial limits for 1 sample 
can then be represented by a horizontal line. In plotting the limits for 
each sample, it is to be expected that, if we calculate the limits at the 5% 
point, 95% of them will cross the vertical line representing the true mean. 

The student should be warned against making the statement that, with 
respect to the above problem, it has been determined that the probability 
is 0.05 that the true mean lies between 2.30 and 7.22. The true mean is 
a parameter of a population and has therefore a fixed value, and we cannot 
make any exact probability statement with respect to where the true mean 
lies. 


9. Exercises. 


1. The figures below are for protein tests of the same variety of wheat grown in 
2 districts. The average in district 1 is 12.74, and in district 2 is 13.03. Test the 
significance of the difference between the 2 averages. Do you consider it necessary 
in this example to make the more exact test described in Sections 5 and 6? 
Protein Results 
District | 126 134 119 128 130 
District 2 13.1 13.4 128 13.5 13.3 127 124 
t = 1.04, Р = 0.3, approximately. 


2. Mitchell [6] conducted a paired feeding experiment with pigs on the relative value 
of limestone and bonemeal for bone development. The results are given in Table 
4-2 below. 


TABLE 4-2 


AsH CONTENT IN PERCENTAGE OF SCAPULAS OF PAIRS OF 
Pics FeD ON LIMESTONE AND BONEMEAL 


Pair Limestone Bonemeal 

1 49.2 51.5 

2 53.3 54,9 

3 50.6 822. 

4 52.0 53.3 

5 46.8 51.6 

6 50.5 54.1 

7 52.1 542 

8 53.0 23:3. 
Mean 50.94 53.14 


Determine the significance of the difference between the means in two ways: (1) by 
assuming that the values are paired, and (2) by assuming that the values are not paired. 
On the basis of your results, discuss the effect of pairing. 

(1) Paired: 1 = 4.42, Р = less than 0.01. 

(2) Unpaired: t = 2.48, P = approximately 0.02 
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3. In а wheat variety test the mean difference between 2 varieties was found to be 
4.5 bushels per acre. The standard error (Sma) was 1.5 bushels per acre. Determine 
the fiducial limits at the 5% point for the mean difference. 

Note. Since a large number of tests were made, / can be taken as 1.96. 


Fiducial limits, 1.56 to 7.44. 
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CHAE aE Э 


The Analysis of Variance 


1. Definition. Basically the analysis of variance is a simple arith- 
metical method of sorting out the components of variation in a given set 
of results. In order to understand this it is essential to know what is 
meant by components of variation. 

Whenever there is heterogeneity of variation more than one component 
is present. Suppose that we have | group of men drawn from a population 
in whith the people are all of the same race and the variable studied is 
height in inches. A frequency distribution is drawn up and found to be 
approximately normal. There is good reason to assume, therefore, that 
variation in this group is approximately homogeneous. The same would 
apply to a group of women studied in the same manner. However, if 
data from the 2 groups are mixed to form a new group, a second com- 
ponent of variation is brought in, namely the difference between the means 
of the 2 groups. This difference would be large enough so that the 
frequency distribution for the combined groups would probably show 
2 peaks or modes. Carrying the analogy further, a third group might 
consist of boys from 13 to 15 years of age, and a fourth group girls of a 
similar age. When all 4 groups are combined the frequency distribution 
might appear reasonably normal, but we know that 2 components оГ | 
variation are actually present, one representing variation within the 
groups and another between the groups. The arithmetical procedure of 
the analysis of variance enables us to sort out and evaluate the components 
of variation for such mixed populations. 

The complete analysis of variance actually performs a dual role. In 
the first place We have the sorting Су жін estimation of the variance 
components, and in the second place it provides for tests of significance. 
The sorting-out process is purely mechanical and can be applied in all cases, 
but the reliability of the estimates of variance so obtained is dependent to 
some extent on the manner in which the data are collected. This is a 
difficult point to explain before the details of the methods of analysis have 
been made clear, so it is deferred to a later section. The consideration 
of tests of significance, however, can be taken up concurrently with a 
description of the method of analysis and the theory on which it is built. 
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х NIA principles. In Chapter 2, Section 4, we stated the 
relation between the standard deviation of a population and that of 
means of samples of size n. Іп terms of variance this is expressed as 


под? = o* | 


where n is the number in each sample, oç? is the variance of the means, 
and o? is the variance of the population from which the samples are 
drawn. From this expression a very simple but important deduction can 
be made. Having drawn a series of samples of size n we can in any event 
calculate a mean square V, for any one sample, and each of these can be 
taken as an estimate of the population variance (0%). Furthermore, we 
can calculate V. from the means of the samples, and from equation (1) 
above it is clear that nV сап also be taken as an estimate of o. 

In order to clarify the above ideas we shall represent the series of 
samples by symbols as follows 


Sample Variates Mean 
1 Жа Au reyes x, 
2 Ха Xat Xurt Xen A^ 
П Ха Xe Ху" Жа Xi 
k Xa Ха‘ Xun Xim X, 


where there аге k samples of n variates cach. To give the symbols à 
concrete meaning it can be supposed that Кл rats are divided at random 
into k samples. Then Ху is the weight of the first rat in the first sample. 
Also, X,, is the weight of the jth rat in (Ме ith sample. 

For each sample we calculate 


5 (x) 
ү =i! 2 
ы n=l 


where x,, represents a single deviation (X, — X.) from the sample mean. 
Since each of these is an unbiased estimate of c*, the series can be averaged 
to provide a single estimate. Thus 


k kon 
2 (V) 2 D (x) 
Fett p m 3 


which amounts to summing all the sums of squares of deviations from 
sample means and dividing by the total number of degrees of freedom 
available within samples. 
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To find пу we take 


where X is the general mean, or average of all sample means. 

If the variates іп all samples are drawn at random, both V and nV 
are unbiased estimates of о, but they are of, course not equally reliable 
since V is based on k(n — 1) degrees of freedom and nV on only k — 1 
degrees of freedom. The random arrangement ensures that any one rat 
has an equal chance of being included in any sample, It should be clear, 
therefore, that the only factor affecting the values of both V and nV is 
the variability of the population from which the samples are drawn. 

It will be of interest to apply this proposition to actual figures, The 
figures below have been placed at random into 5 groups of 5. 


Group Variates Mean 
1 29 45 14 51 24.8 
2 4 38 48 3 M 384 
3 16 29 9 18 46 27% 
4 уп ух з & 302 
5 9: 19 ж 25 25.8 


Calculating V from formula 3 but applying short cuts explained in a later 
section, we get 


Then 


The agreement here is better than would ordinarily be expected. The 
purpose of the calculations is merely to show the meaning of the algebraic 
symbols in terms of actual figures. In a large series of such samples we 
would, however, expect to get very close agreement between V and nV e. 

Our chief interest in the proposition outlined above lies in the situation 
that is created when the population is not homogeneous, as in the example 
mentioned in Section 1 where the variance due to the means is affected 
by fundamental differences between the groups. With groups of rats, 
each group may have been given a different ration. The conditions 
causing variation within the groups will be due to the original variation 
in the population, but the means of the samples will vary additionally 
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owing to the differences in the rations. With symbols the situation can 
be represented as follows. 


Sample Mean 

Sample Variates Mean Square 
1 Xu + Nu Xat Yit Math ЖЕК V, 
2 Xn t+ Ya Ха" Ұс: Xant Yo Xy Y, Vs 
k Ха + Ye Mat Ye ° °° Xin Y, Х,-- Y. Vy 


Note that in sample | each variate value consists of a part X, as in the 
case where all variates are drawn at random, plus a portion Y, which is 
constant for each variate. Thus the mean of sample | is ХХ,/л + n Y,/n 
= ¥,+ Yı. Since Y is constant for any one sample, it does not affect 
the value of V,. Algebraically, we have for sample 1 


У + Y)— (X TY 2, — Xy > (о?) 
1 1 1 

> ELEM. PME De 5 
n—1 n—1 n—1 


5 


Therefore V based on all samples will be the same as if the Y factor did 
not exist. 

Examination of the new mean square for sample means shows that it 
is definitely affected by differences among the values of Y. We shall 
represent it by V'c. 


k . Le O d 7 
ZI; + Y)— (X + Yo О Os CP 


y 
= к= k—1 


V'« consists now of Ve + Vy plus the additional quantity which is а 
sum of products of deviations. Now the part of the differences between 
the sample means and the general mean represented by (X, — X) arises 
purely from random sampling and would bear no relation to corresponding 
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values of (Y; — Y). Therefore, we can assume that for a large series of 
samples this sum of products will be zero. We are left then with 


Vig Vt EE 7 
or 


nV'g = nVg + nVy 


and conclude that for such a series of samples nV' will tend to be greater 
than nV by an amount equal to nVy. 

The numerical example above can be adjusted to show how the alge- 
braic statement works out. For example, we shall take Y; — 0, 
Y, — +2, Y= 0, Y, = 0; and Ys — — 2. Then the figures are 


Group Variates Mean 
1 2911437 14- “237 СЫ 24.8 
2 46 40 50 33 33 40.4 
3 16 29 29 18 46 27.6 
4 <y ЄНЇ! 317 72827-44 30.2 
5 АБ FIO: S17) “18-26 23.8 
We get 
3006.80 
з= = 150.34 as before 
20 
886.97 
"ғ = SUN = 221.74 


The latter is considerably inflated over the previous figure of 148.74, 


showing the effect of Y. 
In an actual experiment we do not know the value of Уу because we 


do not have a set of figures showing the random variation only. All we 
have are the final figures containing both X and Y. We know, however, 
that V is a good estimate of nV; therefore from (7) 


Lys туу К 8 
gives a good estimate of the portion of the total variance due to means 


that can be attributed to the Y effects arising from the treatments applied 
to the samples. Here we have 


221.74 — 150.34 = 71.40 


This procedure is known as estimating the variance component due to 
treatments and represents one of the important functions of the analysis 


of variance. 
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The same general principles hold when the data are classified in two 
ways, as in a simple replicated experiment with a series of treatments. 
The data may be arranged in a table as follows. 


Treatment 

1 2 n 
1 Ха Жа” Xin Rı 
2 Ха Ха Xon R, 


Replicate 


ҚТТ 20200 M 


Неге X, represents the yield of treatment 1 іп replicate 1. This table 
may be regarded as made up of r samples of each, or п samples of r each. 
Thus, if the variates were all drawn at random from a population with 
variance o? and simply arranged as above in rows dnd columns, we could 
make an estimate of this variance in three ways: (1) from the means of 
the rows, (2) from the means of the columns, and (3) from the residual 
mean square not affected by rows and columns. 

In order to clarify this point it is convenient to represent each yield by 
X + Y +Z instead of by X alone. Then Y is the portion representing 
the replicates and Z the portion representing the treatments. 

The same reasoning as was applied to the example of К samples of n 
each will then show that an estimate of the population variance from the 
means of the replicates gives 


nV'g = nVg + nVy 9 
and from the means of the treatments 
Wp Ур + rVz 10 


In (9) nV. is that portion of the mean square for replicates that is due to 
real variation among the replicates and not to random variation from 
plot to plot. Also in (10) rVz is the component of the mean square due 
only to treatments. 

When each variate is represented by (X + Y + Z), where Y, will be 
the contribution due to replicate 1 and Z, the contribution due to treat- 
ment 1, the first variate can be represented by (Xy; + Y; + 21). Then, 
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in order to obtain an estimate of random variation only it should be 
noted that 


(Xa + %+2Z.—M)—(%m + ¥  Z) М] 
— К + Y + Z)— М] 11 


where M is the general mean, Х xı the mean of the replicate containing 
Ха, and X, the mean of the treatment containing Хуу, can be simplified to 

(Ха- Х-(Ха- Х)-(Хті- X) 12 
since M = ¥ + Y--Z. This shows that, for an individual deviation 
from the general mean, if the deviations due to replicate and treatment 
means are subtracted, the remaining portion represents random variation 
only. It shows how a mean square can be calculated that will be an 
estimate of o? in that all effects due to replicates and treatments аге 
removed. The method of making the actual calculations will be demon- 
strated in later sections. 

3. The test of significance. In Section 2 it was noted that, owing to 
treatment effects, the mean square V'« estimated from the treatment or 
sample means tends to be larger than V, the estimate of the variance, 
owing to random sampling variations. V is referred to as the error 
variance and nV’ as the treatment variance. 

A situation of this kind calls for our being able to make some sort of 
probability statement with respect to the result obtained. This would 
be possible if the theoretical distribution of such a ratio as nV'g/V 
could be determined. This problem was solved when R. A. Fisher [12] 
gave the required theoretical distribution of z = + log, (Zı/ Va), where 
V, and V, are independent estimates of the same variance and are based 
on n, and n degrees of freedom respectively. Here we are concerned 
with nV’; and V, the first based on k— 1 and the latter on k(n — 1) 
degrees of freedom. The ratio is 


ЛЕ кут ni 13 
ЖУ? y 
and since nV tends to be equal to V, the ratio amounts to* 
пЁ'х nVy 
2525 — 14 
V E Sy 


The test of significance in effect determines, therefore, the extent to which 
nVy/V differs from zero. 


* The mean value of a ratio of two mean squares is actually nj/(rg — 2) and is there- 
fore always slightly greater than 1. It is close enough to unity in the general case to 
enable us to say that (14) is approximately true. 
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Complete tables for making tests of significance would be very cumber- 
some as there is a different distribution for each pair of values of л; and 
т. This difficulty is overcome in the tables prepared by Fisher [14] and 
Fisher and Yates [16] by means of a series of selected values of л; and ng 
and a complete table for each probability level. Snedecor [23] gives 
Fisher's tables in the form of Vj/V;, the variance ratio, to which he gave 
the symbol F. Fisher and Yates [16] give tables for both z and F. The 
probability levels selected are those in which we are ordinarily interested 
in making tests of significance, such as 0.01, 0.05, and 0.10. 

In the numerical example given above 


26122101426 


Т, 


The mean square іп the numerator is estimated from 4 degrees of freedom 
and that in the denominator from 20 degrees of freedom. Looking up 
Table A-6-and entering under n; = 4 and n, = 20, we find that the value 
of F at the 5% point is 2.87. It would be concluded that the variance 
component Vy is not significantly greater than zero. 

4. Orthogonality. We have seen from Section 2 that variation among 
the treatment means represented by Ж, + Y, does not have any effect on 
V, the mean square arising from variation among the variates but within 
samples. When this is the relation between two mean squares, we say 
that they are orthogonal. Orthogonality is a mathematical term that has 
very much the same meaning as independence. Two effects in an experi- 
ment that can be estimated separately and not entangled in any way are 
orthogonal to each other. Strictly speaking, it cannot always be said 
that two effects are orthogonal without making a mathematical analysis, 
but there are certain general rules that will not lead us astray unless they 
are applied without careful thought to complex situations. This can be 
made clear by means of a simple demonstration. An experiment consists 
of 4 treatments in 3 replications, and after operating the experiment we 
are able to make a table of results as follows. 


Treatment 
A BROCE Replicate Mean 


Replicate 2| Ха Ха Ха Ха | Re 


Variety mean Ta TI TA G 
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On examining this table it is clear that one plot of each variety occurs 
in each replicate. Now suppose that treatment A gives much higher 
yields than the others. This will have no effect on the variation among 
the replicates because treatment A occurs ineach. Similarly, if replicate 1 
gives much higher yields than the others, this will not affect variation 
among the treatments. We can say quite safely that mean squares 
representing treatments and replicates are therefore orthogonal. Picturing 
the opposite situation, suppose that А is much the highest yielding treat- 
ment but the plot of A in replicate 1 is missing. This would tend to 
decrease Ё, as compared to А, and Аз, and the treatment and replicate 
effects as represented by T 4, Tg, То, Tp, and В, Аз, Аз would no longer 
be orthogonal. Й 

5, Partitioning of sums of squares and degrees of freedom. Suppose 
that we have a table of results as follows. 


Treatment Replicate 
1 2 3r ert Total 
1 Ха Ха Хз Xin Ку 
о 
8 2 Xn Xn Xas © Хо Ry 
o 
[2 . . . 
r Ха Ха Xa tto Xm К, 
Treatment Total С, с, C39 5:3 ер G — Grand Total 


For any one observation, say Хуу, we can write 
Qt, — X) = Qi — В) + (Ё— ¥) 
where X is the general mean and R, is the mean of replicate 1. Then 
Qü— XP = (Xu — А) + (R, — Xy + 208 — RR — X) 


and, summating over replicate 1, 


M/s 


ar xy = ОК Ry + sd XY + AR — 3) 206 А) 
1 


n 
The last term is zero because it contains > (X, — R,), which must be zero 


1 
because it is a sum of deviations from a mean. The second last term 
contains (R, — X) which is constant, and therefore in summating it 1S 
merely taken л times. Eliminating the last term, we have 


Ses nou S(x- Ry + (Ry — Xy 
1 1 
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Now, if we repeat this for each replicate and summate over the whole 
experiment, we get 


S- y = (К RY + пу(В,— Xy 15 
1 11 1 


This is a very important equation in that it shows how the total sum of 
squares can be partitioned into one part representing deviations from the 
mean within replicates and another part representing deviations of the 
replicate means from the general mean. Simply by extending the algebra 
we can show that, if a deviation of X,, from the general mean is written as 


(= X) = (R, — X) + (€, — X) + (Xn — X) — (& — X)— (С, — X)] 
(R,— X) - (C, — X) (X — Rf, —€ + X) 


then the total sum of squares can be partitioned into 


S Xy = nS (R— Xy rS (C— £y + SQr— А, С, Xy. 16 
1 


H $ 1 


Total = Replicates -- Treatments + Random variation 


where i runs from 1 to r and j from | to n. 

The degrees of freedom can be partitioned in a similar manner. Ad- 
justment for the mean takes up 1 degree of freedom; therefore a total of 
rn— | are available for partitioning. Taking r— 1 degrees of freedom 
for replicates and n — 1 for treatments leaves 


(rn — 1) —(r— 1) — (n— 1) = (r— 1)\(n— 1) 


for random variations. The equation for degrees of freedom corres- 
ponding to (16) is therefore 


m—1=(r—1) E(na— 1) io ES) 


Total = Replicates + Treatments + Random variations 


The complete analysis of variance is most conveniently set up in tabular 
form as shown below, where we now substitute the term error for random 
variation. 


Sum of Degrees of Mean 

Squares Freedom Square 
Replicates nX(R— Xy r—1 nV, 
Treatments — r"X(C— Ху n=1 rV'z 
Error EX(X— R- C+ Х) (r— (n— 1) v 


Total X(x- X? т-і1 
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This form is convenient for tabulation and for making tests of significance. 
An F value for replicates or for treatments can be calculated from 


nV'g "УФ 
E By 


For F, the degrees of freedom are (r— 1) and (r— 1)(n— 1), and for 
treatments they are (n — 1) and (r — 1)(n— 1). 
The calculations are best carried out as in the formulas given below. 


nr nr 2 

Total У(Х 7 = 9 17 
1 1 rn 
A RATED RI 

Replicates nX(R—Xp- DEDA 18 
1 n rn 
WC DN MES 

Treatments "У(С- Xp = H pee 19 
1 r rn 

Error — Total — Replicates — Treatments 


The entire procedure, including tests of significance, is given in Example 
5-1. 

6. Example 5-1. Twofold classification of variates. In a swine feed- 
ing experiment, Dunlop [10] obtained the results given in Table 5-1. 
The three rations, А, B, and C, differed in the substances providing the 
vitamins. The animals were in 4 groups of 3 each, the grouping being 
on the basis of litter and initial weight. For our purpose we shall assume 
that the grouping is merely a matter of replication. 


TABLE 5-1 


GAINS IN WEIGHT OF SWINE FED ON RATIONS Ax ВС 


Ration T П ш IV Total 
A 7.0 16.0 10.5 13.5 47.0 
B 14.0 15.5 15.0 21.0 65.5 
G 8.5 16.5 9.5 13.5 48.0 

Total 29.5 48.0 35.0 48.0 160.5 
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The form of the analysis is 


SS DF 
Rations 2 
Groups 3 
Error 6 

Total 11 


Calculating the sums of squares, we have 


Total = 2316.75 — (160.5)?/12 = 2316.75 — 2146.69 = 170.06 


Rations = (47.0? + 65.57 + 48.0*)/4 — 2146.69 = 5412 
Groups = (29.5? + + - - + 48.0*)/3 — 2146.69 = 87,13 
Error = 170.06 — 54.12 — 87.73 = 28.21 


This gives us an analysis of variance as follows. 


SS DF MS F 5% Point 
Rations 54.12 2 27.06 sah EL) 
Groups 87.73 3 29.24 6.22 4.76 
Error 28.21 6 4.702 
Total 170.06 11 


The mean square for rations is just significant. The meaning of the 
significance of the mean square for groups depends on the manner in 
which the classification into groups has been made. We have assumed 
here that the groups are merely replications, in which case the error mean 
square is a result of variations within groups not due to the rations. It 
is, therefore, valid to consider this mean square as an error with which 
the others can be compared. The group mean square, since it results 
from the plan of the experiment, is an expression of error control. If the 
arrangement had been other than in groups, we would have had a simple 
classification into within and between rations. The mean square for 
within rations would have been much larger than it is according to the 
present arrangement, and consequently the experiment would have been 
less precise. 

7. Experimental error. In previous discussions the variates have been 
presented by symbols such as X3, + Y, + 21, where Xj, is that portion 
due entirely to random variability, and our problem in the analysis of 
variance was to isolate the mean square resulting from this random 
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variability from other mean squares resulting from the effects of treat- 
ments, replicates, etc. This mean square due to random variability is 
commonly referred to as the mean square for experimental error. In 
other words, it represents in an experiment that portion of the total 
variability that is beyond control. If the previous discussions have been 
followed carefully, it will be obvious that the experimental error is the 
logical measuring stick for making tests of significance of other mean 
squares, because, if the mean square tested is not inflated by any real 
effect in the experiment, it must tend to be equal to the experimental error. 

In spite of the apparent simplicity of the analysis of variance as applied 
to experimental data, there is often much confusion in connection with 
the correct expression for error in tests of significance. In other words, 
it is not always clear what constitutes a valid error for a given test. 

8. Selecting a valid error. Significance is a relative and not an absolute 
term. Differences are found to be significant or insignificant in relation 
to variability arising from a source arbitrarily selected according to the 
interpretation to be put on the result. In this connection the experi- 
menter should understand clearly the concept of tests of significance as 
based on the testing of a given hypothesis. The hypothesis selected for 
test decides what is a valid error. To make these points clear let us 
assume that an experiment is being conducted involving chemical deter- 
minations on 2 kinds of material being compared. The method is to 
draw 20 samples from each material. In the laboratory each sample is 
split and-analyzed in duplicate. The form of the analysis of variance 
may be set up as follows. 

DF Expected Value MS in an Actual 


of MS Experiment 
Materials (4 and B) 1 of sè 
Variability among A samples 19 а? | of oh 
Variability among B samples 19 бу). id 
Between duplicates 40 с? se 


Total 79 


It may be assumed, as in most studies of this sort, that 6,2 and o? are 
of the same magnitude and may be combined to form a single variance 
aè representing the variability due to sampling. The hypothesis under 
test in this example is that the materials are identical. When this 
hypothesis is true, the expected value of o? is o". In other words, 
variation due to the samples will be the only factor contributing to the 
random difference between the two means. Turning to o,*, if the samples 
do not contribute any variability, the expected value is equal to а, 
Thus, іп an actual experiment where samples and materials contribute to 
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the total variability and where the contribution due to real differences 
between the materials is equal to o,? and that due to real differences 
between samples is equal to с, we have 


ТА” 
o? = oj +0? + а, 


с2 = o2 + 02 


оо = 


This fundamental relation can be applied to solve the problem of whether 
5,2 ог 8,2 should be the error term in making a test of the significance of 
52. Тһе existence of a difference between A and В is perhaps important 
in some industrial process, and we will suppose that A gives a higher 
and more favorable result than B. Now, if either of these 2 materials is 
used, the results obtained will be subject to a sampling error because this 
sampling error reflects the fact that the material is not entirely uniform. 
Furthermore, a high degree of refinement in laboratory technique may 
have made о,2 very small. It follows, therefore, that our estimate of 
0,2 as an error would be erroneous as it does not give a measure of the 
random fluctuations to which results with either of the materials would 
be subject in actual practice. This is, of course, provided that o, is 
actually greater than о,2. Owing to the design of the experiment, the 
expected value of ø is equal to or greater than 6,2, -Бш in exceptional 
cases the estimated values obtained in an experiment may give 5,2 > 5,2. 
We would then be able to accept 5,2 as a valid error. 

A question sometimes asked in connection with an experiment of this 
kind is: Why make the determinations in duplicate if the results are not 
applied in the test of significance? To answer this we should note again 
that ү 


and, the smaller we make 5,2, the more sensitive the experiment will be 
in testing for differences between the materials. It is important, therefore, 
to have a measure of this source of error. That portion of the error 
represented by s,? may be beyond our control in that it represents varia- 
bility of the material, but it is quite conceivable that 5,2 may be consider- 
ably reduced by improvements in laboratory technique and the sensitivity 
of the test increased. As a general rule, in experiments of this type 5,2 
is larger than 5,2, which is the same as saying that real variability due to 
sampling usually exists. It is a common practice to make the test of 
significance given by 
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іп order to decide this point. It would, of course, be impossible to make 
the test if we did not have some measure of a,”. 

9. Interactions. It will have been noted that the methods of the 
analysis of variance are particularly appropriate for data that may be 
classified in two ways as into treatments and replicates. Actually the 
method can be applied to data classified simultaneously in several ways. 
For example, we may have 3 treatments each at 2 levels (e.g., 3 fertilizers 
each at 2 rates), and the whole experiment may be replicated 4 times. 
The data could then be set up in four 3 x 2 tables, 1 for each replicate, 
giving a threefold classification. Our chief interest, however, in a situation 
of this kind is in the 2 x 3 table which can be made up of the 6 treatment 
totals. Suppose that the table is as follows. 


Treatment 


A B C 


1 T, 
Level Geo 
2 Ts 


Looking at this table from the standpoint of degrees of freedom, we see 
that out of the total of 5 there is only 1 for levels and 2 for treatments. 
The other 2 degrees of freedom must represent an effect due to the treat- 
ments but not reflected in the marginal totals. It must therefore represent 
something due to the relation between the treatments and the levels. 
This effect is commonly referred to as an interaction. To demonstrate 
the characteristics of an interaction it is most convenient to examine a 
2 x 2 table. We shall assume that there are 2 treatments at 2 levels in 
3 replicates, providing a table of results as follows. 


Treatment 
A B 
1 4 8 12-17, 
Level 
2 8 12 20 = T; 
12 20 32-6 
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The general method of analysis given in equations (17), (18), and (19) 
can be followed, but the discussion can be simplified by applying 2 new 
formulas for sums of squares. 

1. Sum of squares for 2 totals such as T, and Ty. 


eese Eo NUBE Eee Ule Ey) 
n 2n 2n 


where л is the number of individual determinations in each total. 
2. Sum of squares for the interaction in а 2 х 2 table. Let the sub- 
totals in the table be represented in the following manner. 


A B 
| | m 
2 | fa 4 top 


The interaction sum of squares is given by 


(ha + fag) — (a + t, 21 
2n 


where the numerator is merely the square of the difference between the 
diagonal totals. 
In the example we have 


Treatments (SS) — E 22002 5.33 
Levels (SS) Ea or 25:33 
12 
2 
Interaction (SS) = ш L 


and on examining the nature of the sub-totals it is obvious that the inter- 
action sum of squares must be zero if the responses of the treatments are 
the same for both levels. In terms of symbols 


A B 
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where x is the increase in response due to level 2 as compared to level 1. 


It follows that (a + b + x) — (a + x + b) = 0, and the interaction sum 
of squares — 0. 


Considering the responses in two directions, we could represent the 
table algebraically by 


BN ey алх, 


where x represents the response due to B as compared to A, and y the 
response of 2 as compared to 1. Then 


a + (a + x + y)— Ka + y) + (a 4-3) = 0 


and the interaction sum of squares = 0. 
Let us examine the situation when the responses are not equal. For 
example, we have the results 


A B 
1 | 6 4 
2 8 12 


in which B gives a much greater response to the increased quantity of 
treatment than A. Here we have 


14— 16) 
Treatments (SS) = oi = 0.33 
10 — 20 
Levels (SS) ( 5 P 8.33 
18 — 12)? 
Interaction (SS) = шс = 3.00 


and the interaction is appreciable. 
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It is of interest to note that the additive type of response is the only 
one that gives a zero interaction. A proportional response does not. 
For example, when the relation is 


the interaction sum of squares = [(b — a)/x]*/2n, from which it is obvious 
that the interaction will not be zero unless a = b. However, if the values 
are transformed to logarithms, it can be shown easily that the interaction 
becomes zero. 

10. Interactions of higher order. As more factors in an experiment are 
introduced, interactions of a higher order result. For example, in 
addition to the 2 treatments at 2 levels, each set may be applied at 2 
different dates. With replicates we would then have a fourfold classifica- 
tion. The treatment totals may be set up in three tables. 


Date 1 Date 2 Dates 1 and 2 
A B A B A B 


Level 


For each date we have 


Treatments 
Levels 
2-factor interaction 


and, taking into consideration the dates, there will be 1 degree of freedom 
for the dates and 1 each for the interaction of the above 3 effects with 
dates, making a complete allotment of the 9 degrees of freedom as 
follows. 
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Dates 1 
Treatments 1 
Levels 1 
2-factor interaction (treatments х levels) 1 
2-factor interaction (dates x treatments) 1 
2-factor interaction (dates x levels) 1 
3-factor interaction (dates x treatments х levels) 1 


The meaning of a 3-factor interaction is most easily interpreted in 
relation to simple interactions. If the 2-factor interactions (treatments x 
levels) are equal for the 2 dates, the 3-factor interaction = 0. This can 
be proved easily by inserting figures and calculating sums of squares. 
The 3-factor interaction measures, therefore, the extent to which a 2-factor 
interaction for any 2 factors varies with the third factor. 

А 3-factor interaction for 2 x 2 tables may be calculated by a simple 
extension of the geometrical method used for the simple interaction. In 
the diagram below, the dotted lines show the method of obtaining the 


necessary totals. 


Date I Date II 
A B A B 
IA; jig Sees ПА, пваа 2) 
SS Ж E 
ae - pre 
NI га 
1A; ІҢ-------- ПА; аср Т, 


Іп terms of symbols 
T, = 14, + IB, + IIA, + ПВ, 


T, = 14; + IB, + ПА, + IIB, 


Then the sum of squares is 


where n is again the number of individual variates entering into each 
total, such as 71. 

Having seen how 3-factor interactions can be obtained from threefold 
classifications, it follows that with a fourth factor we will have a 4-factor 
interaction, and so on for each factor added. It is clear, however, that 
high-order interactions, at least any interaction of an order higher than 
the 3-factor, is not of much practical significance. In the first place they 
are rarely significant statistically, and in the second place they are too 
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complex for any practical interpretation. It will be seen, as we progress 
with the study of experimental design, that it is frequently desirable to 
sacrifice high-order interactions in the design of the experiment in order 
to obtain more accuracy for the study of main effects and simple 
interactions. 

11. Methods of calculating sums of squares. After mastering the 
general principles of the analysis of variance, the student requires a 
knowledge of methods of calculating sums of squares. The following 
are some of the methods that give the sums of squares for most experi- 
ments, in a systematic and compact form. 

а. Total for a set of n variates, Xy, Xo, Xg" * Xit * * Xw 


5 


5о9-209- b 2 


l1 

n 
ms 

The term X (X?) is known as the raw sum of squares, and the term 
i=l 


i= 


n 2 
[ ©, / п is frequently referred to as the correction term (abbreviated 
i=1 


n 
С.Т.). Since У (X) is merely the grand total, it is convenient to refer to 
ісі 


it as G. 
b. For a set of k groups where each group is made up of n variates. 


k іп 2 
X ОР |5) 
1 = 1 23 


„5 (8, Яу 
1 kn 


c. For а set of k groups where the number of variates is not the same for 
each group. If the series of totals and the numbers in each is represented 
as follows. 


Total ХХ, ХХ, EX УК УХ, Grand total = G 
Number NOD alia а by cse LN NON Yi Grand total — N 
the sum of squares is given by 


Cu» Qux, (бу-су c 


n Ng n; ny N 


24 


Notice that this is a general formula and reduces to the one given in 
(b) above if the numbers in each total are equal. 


(11) METHODS OF CALCULATING SUMS OF SQUARES 83 


d. In (2 х n)-fold tables. Suppose that the table is as follows. 


Treatment 
1 2 з Т gi ser 
Tae X, cM e Ы 5257 
Level 
IS Y, У х Y, л. Ya XY 


X XA HES CRETE TXT GENES 
Т ЕС) 
mv 


Setting out the sums of squares required, we have 


DF 
Treatments п— 1 
Levels 1 
Interaction п-і1 
Total 2n—1 


Instead of 2 levels we may have I and II representing replicates, in 
which case the third term will be an error. 
The total sum of squares is obtained from 


С? 


УХ? + ХҮ? — — 25 
2п 
which is merely an application of (a) above. 
For levels 
Se Sys 
(2X — XY) 26 
2п 
For treatments 
2 
XX re c Ес 
2 2n 


Then for the interaction or error term we can follow two courses. 
First, we can subtract treatments and levels from the total, obtaining the 
error as a difference. Sécond, we can calculate the third term directly, 
from 

У(Х; YY ŒX- Хү) з 
2 2п 


where the term on the right is the same as in (26). 
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It should be noted that these formulas are based on there being a single 
variate in each cell. If the above experiment had been operated with r 
replicates, r would have appeared in the denominator of each term. 
Thus the last formula would have been 


У(Х, Y) (®Х—УҮ)? 
2r 2rn 


Also the total sum of squares given above is the total for 2n — | degrees 
of freedom in the table. The grand total sum of squares will be 


29 


У(Х?) + (Y2)— = 30 
2rn 


where Х and У here are individual variates and not sub-totals. 
е. In ап (m x n)-fold table. Suppose that there are n columns and 
т rows in the table, 


Column 
1 2 n 
1 Ха Xo oc Xin Та 
2 Ха Ха Xan Т 
Row : 
QUELS залына nce Tan 
Та То E G 


Setting up the outline of sums of squares and degrees of freedom, we have 


55 DF 
Columns n—1 
Rows т-і 
Interaction (C x R) (n— 1)(m— 1) 
Total тп— 1 
Тһеп 
2 
Total sum of squares = XX?— сой 
тп 
X(T.2 2 
Columns = LO) Er 
m mn 
Rows нас 
п тп 
4 =r QT 
Interaction xx? С AL S 
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The last line is the same as subtracting the sums of squares for rows and 
columns from the total. If the experiment is replicated r times, r will 
appear in the denominator of all the above expressions. Also the total 
sum of squares given above will be the total for the sub-totals of the table 
representing mn — 1 degrees of freedom. The grand total sum of squares 
will be 


where X here is an individual variate and not a sub-total. 

f: Three-factor interactions. Assuming that there are 3 factors A, B, 
and С each at more than 1 level, a general method for calculating 3-factor 
interactions is to obtain the 2-factor interaction for any 2 factors, for 
each level of the third. The sum of these less the 2-factor interaction for 
all levels of the third factor combined gives the 3-factor interaction. Thus 
there are always three tables to be formed. For example, if A, B, and C 
are each at 2 levels, we have 


(2 С; с.с, 
А, А А А, А, Ay 


B, | B, В, 


В, В, В, 


Ax B x C=(A x В) for C, + (A x B) for С,- (A x В) for (C, + Ca) 


In certain cases there are simplifications. For example, ina 2 x 2 x 2 
experiment the 3-factor interaction can be determined directly by obtaining 
2 totals as indicated diagrammatically below. This procedure was given on 


page 81 but is repeated here for the convenience of having all calculation 
methods in one place. If the totals are Т) and 7, the sum of squares is 


путу x 
n 


where п is the number of variates entering into each total. 
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There is also a simplification that can be employed in the calculation 
of 3-factor interactions for 3 x 3 x 3 experiments. Supposing that the 
factors are N, P, and K, each at 3 levels, we would have three tables as 


follows. 
Kı Ka Ks 
P P Р 
1 3 1 2 3 1 
1 111261217 131 1 112 122 132 1 113 123 
N2 211 221 231 2 212 222 232 | 2 213 223 
B 311 321 331 3 312 322 332 2 313” 323. 1333 


233 


Each table enables us to calculate the interaction N x P for 1 level of K. 
Note first that a 3 x 3 table can be divided into 2 sets of diagonal 


totals as in the squares below. 


3 x 3 Table 
11 12 13 
21 22. £23 
31 32 93 


1 Diagonals 
11 + 22 + 33 = д 
21 4 32 + 13 = h 
31 + 12 + 23 — 1 


J Diagonals 
11 + 32 +23 = J 
21 + 12 + 33 = ول‎ 
31 + 22 + 13 = و‎ 


The 4 degrees of freedom for interaction in a 3 x 3 table can be par- 
titioned into 2 for the / totals and 2 for the J totals. The sums of Squares 


would be 


I (totals) 


J (totals) 


12 + Tè + 
3n 


3n 


С? 
9л 


ve + J? + J? G? 


9n 


32 


where С is the grand total and each sub-total in the square consists of n 


individual determinations. 
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From the three tables of data we can write down first a set of I 
totals for each. For example, 
111 + 221 4-331 = 4, 112+4+2224332=B, 113 +223 + 333 = С, 
211 + 321 + 131 = A}, 212 + 322 + 132 — B, 213 + 323 + 133 = С, 
311 + 121 + 231 = A, 312 + 122 + 232 = B, 313 + 123 + 233 = C, 


From the J totals another J set and a J set can be written. Thus 


1 Set J Set 
A+ Ba + C, = Һ А+ В+ Сұ-Л 
Ay + By + G = Ip A, + By + Cy Js 
Аз + Bı + C4 — 1 Аз + Ba + С = Js 


Starting again from the beginning, we first write down the J totals from 
each table. 


111-231 + 321 = Ау 1124 232 + 322 = B^ 113+ 233+ 323 = C^, 
21--121--321-4% 212+4+122+332=B% 213 + 123 + 333 = C^ 
311--221 + 131 — A’, 312 +2224 132 — B, 313-- 223 + 133 — СУ 


and these will also yield an / and a J’ set. 


I Set J' Set 
A^ b Bat CS T Ay B Съ = 
A's + Въ + Съ = 1, А» + By + Сз = J's 
A EE C= TI. AS BIT C= J^ 


Each of the above sets of 3 totals will give a sum of squares corresponding 
to 2 degrees of freedom for the 3-factor interaction, and summing these 
for the 4 sets completes the calculation. 

This principle will be employed again in the discussion of experimental 
design. It is sufficient to note here that the sums of squares arising from 
the 4 sets of totals are orthogonal. Therefore the 8 degrees of freedom 
for the 3-factor interaction N x P x R is broken up into 4 pairs that are 
mutually orthogonal. 

12. Partitioning of sums of squares into components corresponding to 
individual degrees of freedoms. One of the results of the analysis of 
variance technique is that the experimenter may be inclined to assume 
that the establishment of significant differences among the treatments is 
a sufficient conclusion. Actually this may be only preliminary to learning 
more about the differences and the characteristic effects of the treatments. 
A useful technique for breaking down the sum of squares for treatments 
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into individual components arises from the procedure described by 
Fisher [13], Yates [24], and Cochran and Cox [8] for partitioning the 
sums of squares into orthogonal components, each representing a single 
degree of freedom. 

The components arise from what are known as linear functions of the 
treatment totals. For example, with 5 treatment totals 7), Т», Ta, Т 
and Т;, one linear function would be 


Zar ee ST US ORT 33 


In general 
Z=1T,+hT,+-+++)T; 34 


where there are k totals, and in such linear functions a characteristic 
requirement for our purpose is that 


L-h- Fl —0O 35 


In equation (33), for example, 1 + 1—1—1+0=0. 
For such linear functions of the treatments the corresponding sum of 
squares is z?/ D, where 
D —r(l?- Ba CC BN) 36 


and r is the number of replications. 

After writing any one such linear function representing 1 degree of 
freedom, it is obvious that other similar functions must exist for the 
remaining degrees of freedom. These functions can be written down by 
a rule based on the following. Any 2 linear functions, such as 


2ı = aT, + hT +++ + dT 37 
Za = АТ, + 1» + * > + + T. 

are said to be orthogonal if 
М + halo + Б = 0 38 


Applying this to equation (33) above, we have 
z —T,-c T,—7,— T, 4- (0 x Т) 
and 
Za = T, — Ts + (0 x T3) + (0 x T4) + (0 x T;) 
is an equation such that the sum of the coefficients = 0, and the sum of 
the products of the coefficients for z, and z, 
(xD-c-(1x—1)-0 


It is thus proved that the two comparisons are orthogonal. 
Returning to equation (33), the following 4 linear functions can be 
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shown to be orthogonal. The coefficients are written opposite each 
function for convenience in obtaining the sum of products. 


() +37, Т, %--%4-1--0--0-0 
(2) T, T: O ON 
Өрме спам unl NIE um en 
(4) тт, e =0 


It will be found that the sum of products for all pairs of functions is zero; 
hence it can be assumed that the functions are orthogonal. 

The corresponding sums of squares are as follows, where the divisors 
are given by equation (36), and it is assumed that we have 4 replicates. 


z? B + Т) – Туе 


(1) 
Di 6 
m zt (n-T 
D; 8 
Q) z _ BRT, + Ts +T) — 3T, + ТУР 
D; 120 
Й ж (Т) 
4 Fl Se Д 
(4) D, 3 
Using actual figures, let 
T, = 1652 
Т, = 1579 
Т,-- 347 
T,= 317 
T= 151 
Then 
it 2 
(1) [4(1652 + d 347] = 268,182.0 
(2) = Suec = 3,444.5 
e DE 
(9) [2(1652 + 1579 + 347) — 3(317 + 151)] 215,112, 
120 
2 
(4) = qeu 155 = 666.1 


Total = 548,005.1 
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Calculating the total sum of squares for treatments in the usual way, 


1652? + 1579? + 347° + 317 + 1512 404€? 


2 
4 20 548,005 


shows that the arithmetic of partitioning is correct: 

Since the comparisons discussed here represent individual degrees of 
freedom, it is obvious that, if the standard errors of the comparisons 
based on the means can be obtained, an application of the / test will give 
a measure of their significance. The general rule for calculating the 


standard error is : 
ТАА ls! i 
аз үк к ea 39 
ТІ! 7а Tk 


where s is the root mean square in the error line of the analysis of variance. 
Suppose, for example, that the t test is to be applied to comparison (3). 
Using the means gives 


2(413.0 + 394.8 + 86.8) — 3(79.2 + 37.8) = 1438.2 


Let 

s = 54.7 
Then 

2 2 2 2 2 
CN a 1813 150 

And 

1438.2 

SIS 9.59 


It should be noted that r is not always the same for all treatments as 
in a factorial experiment where there may be a control for each level of 
the treatments. 

13. Example 5-2. Selecting a valid error. A series of 5 wheat varieties 
was grown at 4 stations, and baking tests were made on the flour. A 
sample of each variety was taken from each station and milled into flour. 
Two loaves were baked from each sample. The error of determination 
was given, therefore, by the differences between the volumes of the 
duplicate loaves. 

To determine the form that the analysis of vatiance will take, note first 
that there must be a station mean square represented by 3 degrees of 
freedom, and a variety mean square represented by 4 degrees of freedom. 
There must also be an interaction effect which may be regarded as the 
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differential response of the varieties at the different stations. The rule 
for finding the degrees of freedom for an interaction is to multiply the 
degrees of freedom for the interacting factors. The interaction mean 
square (MS) must therefore be represented by 3 x 4 = 12 degrees of 


TABLE 5-2 


DUPLICATE LOAF VOLUMES FOR 5 VARIETIES OF WHEAT GROWN АТ 
4 STATIONS (LOAF VOLUMES IN cc. — 500)/10 


Station 
I п ш IV Total 


| 75 45 155 140 165 145 190 186 | 1101 


1 
2 12.5 13.2 200 185 15.0 14.0 23.8 24.4 | 141.4 
Variety 3 70 10 100 80 15.5 140 178 18.5 91.8 
4 15 20 130 150 85 90 148 16.6 80.4 
5 280 29.0 19.5 160 10.5 120 220 248 | 1618 
Total 106.2 149.5 129.5 200.3 585.5 


freedom. There is a total of 40 determinations, so there is a total of 39 
degrees of freedom. The remaining 20 degrees of freedom must represent 
the error of duplicate determinations. We have a check on this because 
there are 20 pairs of loaves, and, since each pair gives us 1 degree of 
freedom, there must be 20 in all. The final form of the analysis is: 


MS DF 

Stations 3 
Varieties 4 
Interaction (varieties x stations) 12 
Error 20 
Total 39 


To obtain the sums of squares another table is required. This table, 
shown below, gives the values of (X, — Ху and (X; + X2), where X, 
and X, are taken to represent the paired values. 


1 п ш ІУ 
1 3.0 1.5 20 04 
2 |-0л 1.5 10 —06 
Dr DES 6.0 2.0 15 — -0Л 
4 0.5 2.0 0.5 18 
5 | —10 SS i 225 
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I i ІШ IV Total 

1 12.0 29.5 31.0 37.6 110.1 

2 25.7 38.5 29.0 48.2 141.4 

Xp TX, 3 8.0 18.0 29.5 36.3 91.8 

ш 4 3.5 28.0 17-5 31.4 80.4 
5 57.0 35:5 22.5 46.8 161.8 

Total | 106.2 149.5 129.5 200.3 585.5 


In outlining the form of the analysis for these data, the first logical 
division is into 


DF 
Between pairs 19 
Within pairs 20 


The first of these may be calculated from the table of (X, — X5) and the 
second from the table of X, + X, as follows. 


Between pairs ; ХХ, + X = 
where N is 40, the total number of observations. 

Within pairs = $2(X, — X; 
This will be the error term of the analysis. 


The sum of squares and degrees of freedom for between pairs now have 
to be divided into 


DF 
Stations 3 
Varieties 4 
Interaction (varieties x stations) 12 
The calculations are 
Total = 10,329.73 — Te e 1759.47 
Error = 1(93.33) = 46.67 
Between pairs — 305055: n — 55983 
TS, TIA 4 
= 10,283.06 — 8570.26 = 1712.80 
Stations = ee — 8570.26 = 481.64 
_ 2B, 186.61 
Varieties 8 


Interaction 1712.80 — 481.64 — 578.07 — 653.09 
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The analysis of variance is as follows. 


SS DF MS 
Stations 481.64 3 160.5 
Varieties 578.07 4 144.5 
Interaction (varieties х stations) — 653.09 . 12 54.42 
Error 46.67 20 2.334 
Total 1759.47 


Seeing that the four component sums of squares add to the total as 
calculated provides a good check on the work. 

It now has to be decided whether to use the mean square from the 
duplicate loaf volumes or the interaction mean square to test the signifi- 
cance of the differences between stations and varieties. If the purpose of 
the experiment is to determine which of the varieties will give the highest 
loaf volume over the whole area that the stations sample, it will be neces- 
sary to make the comparison of the variety mean square with the inter- 
action mean square because in this light the stations are merely replications 
of the experiment. The error from duplicate loaf volumes will give an 
indication merely of the accuracy of the laboratory technique. If it is 
large it will reduce the significance of the differences because it raises the 
value of the interaction variance. 

On comparing the variety! mean square with the interaction mean 
square, we get an F value of 2.66; and, since the 57; point is 3.26, it must 
be concluded that, considering the whole area being sampled, the differ- 
ences in loaf volume are not significant. In other words, the variation 
in the order of the mean loaf volumes of the varieties, from station to 
station, is so great that the differences between the means for the whole 
area may easily be accounted for by this variation. 

The interaction mean square is very much higher than that arising from 
differences between duplicate loaf volumes. This means that the labora- 
tory error is not an appreciable factor affecting the precision of the results 
in this experiment. 

Since variety tests are conducted in replicated plots at each station, it 
follows that, if loaf-volume determinations had been made on each plot, 
another measure of error could have been obtained. This error would 
have measured the variation due to soil heterogeneity; and, if the variety 
mean square for the whole area was significant when compared to the 
pooled error due to soil heterogeneity, this would indicate that in general 
at each station the differences between the means of the varieties were 
greater than could be accounted for by such sampling variation. This 
would not, however, alter our conclusion based on the test in which the 
interaction is the error. 
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14. Example 5-3. Threefold classification of variates. In testing out 

` a machine for molding the dough in experimental baking, Geddes et al. 

[17] employed 3 adjustments of the machine, designated A, B, and C, and 

tried them out on a series of 5 flours baked according to 2 formulas. The 
loaf-volume data are given in Table 5-3. 


TABLE 5-3 


LoAr-VOLUME RESULTS IN A TEST OF A MACHINE FOR MOLDING 
THE DouGH (LOAF VOLUME IN cc. — 500)/10 


Flour 
Formula | Machine Setting Total 


A 9Ж 267 123 "46- 13.5 42.4 
Simple B 967 93:1. 13:07 99453) 74338 43.8 
С; 9,677 277-124: 1:8: 2130 39.5 


Flour sub-total | 28.6 84 37.7 107 403 125.7 


A 1437216 719,4. 13:5 245 92.7 
Bromate B 12.7 22.6 20.6 104 243 90.6 
C 1216: 21,85: 20,9. -:6:8^ 232 85.3 


Flour sub-total | 39.0 66.0 60.9 30.7 72.0 268.6 


Flour Total 67.6 744 98.6 41.4 112.3 394.3 


When the form of the analysis is worked out and compared to those 
that have been worked out previously, we find an additional complication. 
The 6 rows in Table 5-3 represent 2 classifications, but for the present we 
shall consider them as 6 classes, giving us a simple twofold classification. 
The form of the analysis is then 


DF 

Flours 4 
Classes 5 
Interaction (flours x classes) 20 
Total 29 


But the 5 degrees of freedom for classes must be split up into 


Machine settings (ABC) 2 
Formulas (SB) 1 
Interaction (4BC x SB) 2 
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Hence the interaction in the first analysis is an interaction of the above 
3 factors with the flours. Realizing this, we can then write out the form * 
of the analysis in full. 


DF 

Flours (1 * + ° 5) 4 
Machine settings (ABC) 2 
Formulas (SB) 1 
Interaction (ABC x SB) 2 
Interaction (1 - - ° 5 x ABC) 8 
Interaction (1 - - - 5 x SB) 4 
Interaction (1 * * * 5 х ABC x SB) 8 
Total 29 


The last interaction is known as а 3-factor interaction. In this example 
it represents the degree to which the interaction of (ABC x SB)is different 
for the different flours. If the interaction (АВС x SB) is the same for 
each flour, the 3-factor interaction will be zero. 

To determine the sums of squares for the components set out above it 
is necessary to set up three calculation tables as below: 


4 Flour 
Machine Total 


Setting 1 2 3 д 5 


А 23.1 24.2 31.7 18.1 38.0 135.1 
B 22.3 25.7 33.6 14.7 38.1 134.4 
с 222 24.5 33:3. 8.6 36.2 124.8 


Total 67.6 74.4 98.6 41.4 1123 394.3 
Flour 
Formula Total 
1 2 3 4 5 

5 28.6 8.4 EXP 10.7 40.3 125.7 

B 39.0 66.0 60.9 30.7 72.0 268.6 
S+B 67.6 74.4 98.6 41.4 1123 394.3 
S—B 10.4 57.6 23.2 20.0 — 31.7 |- 142.9 


Machine Setting 


Formula A B с Total 
8 42.4 43.8 39.5 125.7 

B 92.7 90.6 85.3 268.6 
S--B 135.1 134.4 124.8 394.3 


S—B 50.3 46.8 45.8 142.9 
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The calculations are 


394.3? 
Total 6618.43 — a0 6618.43 — 5182.42 — 1436.01 
Flours (1 == + 5) — (34,152.33/6) — 5182.42 - 509.64 
Settings (ABC) = (51,890.41/10) — 5182.42 = 6.62 
Formulas (SB) = (268.6 — 125.7)°/30 - 680.68 
Interaction (ABC x SB) = [Z(S— B)*/10]— 680.68 = (6817.97/10) — 680.68 1.12 
Interaction (1 * * + 5 x ABC) 
Total for table = (11,436,57/2) — 5182.42 = 535.86 
Flours (1 * * * 5) — 509.64 
Settings (АВС) = 6.62 
Remainder (1 * + - 5 x ABC) - 19.60. 
Interaction (1 * * - 5 x 58) = [X(S— B)?/6] — 680.68 = (5369.05/6) — 680.68 — 214.16 
Interaction (1 <> 5 x ABC x SB) = Remainder = 4.19 
The analysis of variance when set up in detail is as follows. 
55 DF MS Е 5% Point 
Flours (1 + * - 5) 509.64 4 127.4 243.1 3.84 
Formulas (SB) 680.68 1 680.7 1299.0 537 
Interaction (1 - - + 5 x SB) 214.16 4 53.54 102.2 3.84 
Settings (ABC) 6.62 2 3.31 6.31 4.46 
Interaction (ABC x SB) 1,12 2 0.560 107 446 
Interaction (1 - * - 5 x АВС) 19.60 8 2.450 4.68 3.44 
Interaction (1 - + + 5 x ABC x SB) 4.19 8 0.524 
Total 143601 29 


It is of interest to make a detailed study of Example 5-3 from the 
standpoint of the selection of a valid error. Note first that the determin- 
ations were not made in duplicate so that there is no real measure of the 
error in the technique. In the second place it must be remembered that 
the primary object of the experiment is to study the differences in the 
loaf volumes due to the different settings of the machine and the differ- 
ential responses due to these same settings. For this reason the analysis 
of variance has been separated into two portions. The three effects in 
the first group are of no particular interest, as previous experience would 
have enabled the cereal chemists to predict that just such results would 
be obtained. The separation of these three effects into one group is not 
a result of the data obtained in the experiment but was preconceived, and 
it was decided before the experiment was operated that this would be done. 
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Considering the variance due to the settings, the first question to be 
asked is whether or not it should be tested against a variance representing 
pure laboratory error or against one of the 2-factor interactions. It 
should not be tested against (АВС x SB) because in actual practice either 
the simple or the bromate formula will be used. Actually there are two 
questions to be answered here, and there are answers appropriate to both. 
If the effect of the settings is considered in relation to results with a series 
of different flours, the correct error would be the (flours x settings) 
interaction. The result would indicate no significance for the effect of 
settings. If the effect of settings is to be judged on a series of loaves 
from the same flour, the correct error would be the 3-factor interaction 
in that it most nearly represents laboratory error. 

The F values and their 5% points are given in the analysis, and with 
their aid the results may be summarized very quickly. The flour and 
formula differences as well as the interaction between them are very large 
in comparison to the experimental error and may be dismissed with that 
statement. The primary interest in the experiment is in the settings. of 
the machine and the interaction of the settings with the other factors. 
The settings are significant in relation to experimental error, and a glance 
at the totals indicates that this must be due to the fact that the C setting 
gives a somewhat lower loaf volume than A or B. The interaction of 
ABC with the formulas S and B is not significant, indicating that the 
differences between the settings are reasonably consistent for both methods 
of baking. The interaction of the flours with the settings is significant, 
and we can conclude that the results with the flours are to a certain extent 
changed by the machine settings. From an inspection of the results this 
would seem to be due to flour 4, as the В and C settings depress the loaf 
volume for this flour to a greater extent than for the others. 

15. Limitations of the analysis of variance. Cochran [7] has stated that 
“the analysis of variance depends on the assumptions that the treatment 
and environment effects are additive and that the experimental errors are 
independent in the probability sense, have equal variance, and are normally 
distributed." It will be recalled that in Section 2 an individual result was 
represented by the symbols X + Y + 2, where X is the part due to 
random variation, Y the part due to replicate, and Z the part due to 
treatment. It is necessary only that the X be independently and normally 
distributed with a mean of zero and variance o? in order for the F test to 
hold. A correlation between experimental errors may exist under certain 
conditions. Actually in field experiments the errors of adjacent plots are 
distinctly correlated in that they tend to vary together. This difficulty is 
very largely overcome in field experiments by a process of randomization 
which is illustrated in later chapters. Inequality of variances arise 
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frequently. In Chapter 11 one experiment is described in which this 
occurs, and methods are illustrated for obtaining more accurate tests of 
significance than in the straightforward application of the analysis of 
variance. Lack of normality in data also occurs frequently, and a brief 
outline is given below on methods of transforming data to some other 
scale such that a closer approach to normality is obtained. 

16. Transformations. These are described below with reference to 
common types of data requiring transformation. 

a. Small whole numbers. Suppose that we are making counts of the 
number of insects per square foot of soil where the concentration of 
insects is quite low. The count for a number of individual square feet 
may be zero, and the maximum may be around 10. The most appropriate 


transformed variate is VX, where X is the actual count. When zeros 


occur, it is advisable to take V 4+ X as the transformed variate. 

b. Fractions or percentages. Percentages are particularly likely to show 
deviations from the normal distribution. As a matter of fact, it will be 
clear from Chapter 3 that the distribution expected is the binomial. When 
p #4, it is obvious that the distribution will be skewed. When p is equal 
or nearly equal to q, the distribution is sufficiently symmetrical to be 
treated as normal. A good working rule is to transform percentages 
unless nearly all the data lie between 30 and 70%. The appropriate 
transformation is to the angle for which the sine is the square root of the 
percentage to be transformed. This is known as the inverse sine trans- 
formation for which tables are given by Fisher and Yates [16], Bliss [3, 4], 
and Snedecor [23]. When a series of experiments are analyzed singly 
and then thrown together in one analysis, it is obvious that the transfor- 
mation must be made throughout, regardless of the range. 

c. Whole number counts with a wide range. Referring again to insect 
counts in soil, if the concentration is high, counts in certain squares may 
be very high. Imagine a soil treatment for insect control where with an 
effective treatment the counts range from 5.to 20, whereas with an in- 
effective treatment they may range from 100 to 10,000. In such cases 
there is usually a proportional relation between the mean and the standard 
deviation. The appropriate transformation is to the logarithm of the 
count. Here, again, when zeros are encountered, it is advisable to add 1 
to each number before taking the logarithm. 

Generally speaking, in examining data it is worth while to watch for 
skew distributions, unequal variances, and other abnormalities with a view 
to making appropriate corrections or transformations before proceeding 
with the analysis. Further details on this subject can be obtained in 
Biometrics of March, 1947, in which a whole issue by Churchill Eisenhardt, 
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W. G. Cochran, and M. S. Bartlett is devoted to explaining the theoretical 
requirements of the analysis of variance, the consequences of abnormality 
in data, and appropriate transformations. 


17. Exercises. 


1. Table 5-4, taken from data by Crampton and Hopkins [9], shows the gains in 
weight of pigs in a comparative feeding trial. The 5 lots of pigs represent 5 different 
treatments, and there were 10 pigs in each lot. Make an analysis of variance for the 
data, and test the significance of the treatment differences. 


TABLE 5-4 


GAINS OF PIGS IN A COMPARATIVE FEEDING TRIAL 


Replicate Lot I Гос  Lotll  LotIV шу 
1 165 168 164 185 201 
2 156 180 156 195 189 
3 159 180 189 186 173 
4 167 166 138 201 193 
5 170 170 153 165 164 
6 146 161 190 175 160 
vi 130 171 160 187 200 
8 151 169 172 177 142 
9 164 179 142 166 184 

10 158 191 155 165 149 


The error mean square in this experiment works out to 243.6. 


2. In agronomic trials of varieties of cereal crops it is desirable to conduct the trials 
at various points in the area under consideration and to carry them on for a period 
of 2 or more years. Immer et al. [20] have given data on barley yields at several 
stations in Minnesota over a period of 2 years, Table 5-5 shows the yields at 3 
stations for 2 years for 6 varieties. Analyze the results. 

Note that the blocks are numbered 1, 2, and 3, but this does not mean that block 1 
at University Farm has any relation to block 1 at Waseca or any other station. Conse- 
quently the sum of squares and degrees of freedom for blocks are worked out at each 
station and lumped together in the final analysis. A common error that beginners 
make in sorting out the degrees of freedom for an experiment of this kind is to regard 
the blocks as a factor occurring at 3 levels, and thus they have such expressions in their 
analysis as these: 

Blocks Stations 


Blocks x Years 
Blocks x Stations x Years 


etc. 


These expressions obviously have no meaning as the block numbers do not represent 
definite levels that are uniform at all stations. The correct procedure is therefore to 
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calculate the block sum of squares for each experiment and add all these sums of 
Squares together in order to show them in the final analysis. 


TABLE 5-5 


YIELDS IN BUSHELS PER ACRE OF 6 VARIETIES OF BARLEY GROWN 
AT 3 STATIONS IN EACH OF 2 YEARS 


Block Manchuria Glabron Svansota Velvet Trebi Peatland Station Year 


1 29.2 44.6 33.9 36.7 41.2 38.5 Universit 

2 25.0 39.1 39.4 41.0 31.9 29.6 Parm Y 1931 
3 26.8 45.5 32.1 42.0 36.6 30.2 

I 19.7 28.6 20.1 20.3 19.3 223 Universit 

2 31.4 38.3 30.8 27.5 22.4 30.8 iun Y 1932 
3 29.6 43.5 31.4 32.6 45.5 31.1 

1 47.5 55.4 44.5 56.9 63.9 41.2 

2 52.2 53.4 46.0 40.6 63.8 51.5 Waseca 1931 
3 46.9 56.8 51.5 53.2 63.8 53.0 

1 40.8 44.4 41.0 44.6 53.5 39.8 

2 29.4 34.9 41.1 41.4 44.2 39.2 Waseca 1932 
3 30.2 33.9 33.4 26.2 50.0 29.1 

1 24.0 275 26.5 27.2 42.1 24.7 

2 24.7 25.5 21.5 28.0 42.5 29.5 Morris 1931 
3 33.6 33.3 29.3 23,2 46.7 35.4 

1 29.6 36.6 27.1 35.9 40.0 35.7 

2 34.1 34.3 35.7 33.9 46.9 41.9 Morris 1932 
3 39.4 34.5 42.3 46.7 53.0 52.0 


The following values for the sums of squares will assist in checking the calculations: 


Total 11,504.61 
Varieties 1,566.58 
Varieties x Stations x Years 230.52 


3. Prove that Ў озу = Sa — [E(X)P/n. 
1 1 


Find the 5% points of F for the following values of n, and ng. 


n т т ms 
3 51 16 39 
6 43 18 215 
4 92 17 19 

12 195 36 28 
7 36 28 154 


11 64 53 42 
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4. Table 5-6 gives a portion of the data given by Jones [22] on wire-worm counts. 
Convert the data to УҘ + X and perform an analysis of variance. Calculate means 
for the treatments and гесопуегі to counts. Use Square roots accurate to two decimal 
places. 

TABLE 5-6 


WiRE-WoRM COUNTS FOR 5 TREATMENTS IN 
AN EXPERIMENT WITH 3 REPLICATES 


Replicate 1 2 M r d 
Replicate 2 М T S M F 
Replicate 3 A М * A d 
Treatment mean square — 1.0133. 
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CHAPTER 6 


Linear Regression Analysis 


1. General observations. In the previous discussions emphasis was 
placed on the variations that occur in any one variable, such as the yield 
of plots of wheat, the weight of animals, the count of diseased plants per 
plot, or the volume of loaves of bread. The present chapter deals chiefly 
with the joint variation of two variables, and particularly with the situation 
where the variation of one variable is sensibly dependent on variation in 
the second variable. Thus one variable is said to be dependent and the 
other one independent. We might have the two variables, rainfall and 
yield. It is well known that rainfall affects yields, so rainfall would be 
considered the independent variable, and yield, the dependent. 

In Chapter 5 we had some contact with problems of this type. For 
example, the yields of plots on which different levels of fertilizer have 
been applied actually present a regression problem, where yield is the 
dependent variable, and amount of fertilizer, the independent. In an 
actual trial we might have results somewhat as follows. 


Mean 
Quantity of fertilizer, pounds per acre 100 200 300 400 250 
Yield of wheat, bushels per acre 19 24 21 30 25 


The results can be graphed as in Figure 6-1. 

The first point that strikes us about this result is the fairly uniform 
increase in yield with uniform increases in amount of fertilizer. This is a 
practical way of describing the effect in terms of yield and fertilizer. In 
statistical language, however, we would say that the response of wheat 
yield to quantity of fertilizer appears to be linear, i.e., can be represented 
by a straight line. The next thought is that if the relation between the 
two variables is linear it can be represented by an equation of the form 
Y =a + bX, where Y is yield and X is quantity of fertilizer. Тһе term 
а represents the origin of the line or the value of Y for X = 0, and b 
represents the slope of the line. Thus b is the increase in Y for a unit 
of increase in X. 

2. Derivation of the regression equation. The situation outlined in 
Section 1 immediately poses the question of how to set up the required 
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equation. There is a good deal of elegant theory involved in this, and 
for an excellent exposition of the whole subject the student is referred to 
Chapter 6 of Statistical Methods by G. W. Snedecor [5]. Неге we shall 
be satisfied with the statement that the most suitable equation for the 
average problem is the one for which the sum of squares of the deviations 
from the straight line is a minimum. We shall take the regression line to 
be fitted as Y, = a + bX, where the subscript indicates that Y is an 
estimated quantity. There is a definite reason for selecting this particular 
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FIGURE 6-1. Graph showing relation between yield and quantity 
of fertilizer. 


straight-line equation. In the first place, by putting in the constant a, the 
position of the line on the Y axis is allowed to take any value required 
and does not have to pass through any particular point on the Y axis. 
Thus, if X = 0, Y, = a. If we had put Y, = bX, then the line would 
have been forced to pass through the origin of Y and X. The values of 
the constants a and 6 are to be such that the sum of squares of the errors j 
of estimation will be a minimum. Figure 6-2 illustrates the situation 
for one pair of variates, Y and X. We see that у = y, + Ya, where y, 
represents the deviation of Y, from Y, and y, the deviation of Y from the 
regression line. The regression line to be fitted is Y, = a + bX, but it 
is quite permissible to transform Y and X to deviations from their respec- 
tive means, whence the regression equation becomes Y,— Y = а + 
b'(X — X), or y, = а + b'x, using the prime notation for a and b because 
we do not know aí this stage if they will be the same as in Y, — a + bX. 
The! sum of squares to be minimized = X(Y— Y,? = Ху). From 
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Figure 6-2, y, = (y — у,), and therefore, оп substituting for y,, we һауе 
Ju — (y —a' — bx). 

The sum of squares to be minimized is X(y — a’ — b’x)? — f. Differ- 
entiating with respect to a', we have* 


С 2X(y— а — Бх) 25у + 2na' + 2b'Ex 


and then with respect to b’, 


T = — 2xX(y — a’ — b'x) 2Xxy + 2a'Ex + 2b'x 


Y-0 
X-0 X 
FIGURE 6-2. Illustrating deviations from regression and deviations from 


the general mean of Y, for one pair of variates of X and Y. 


where n represents the number of pairs of variates. Putting both deriva- 
tives equal to zero, we get the two equations 

Xy = a'n + b/'Ex 1 

Уху = a'Zx + БУ? 2 

These are known as the normal equations, which we have derived here by 


the method of least squares. Then, since Xy and Xx are equal to zero, 
a’ = 0, and 


* Students not familiar with differential calculus must take it for granted that equa- 
tions (1) and (2) do result from minimizing the sum of squares for deviations from the 
regression line. 
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The statistic b’ is commonly referred to as the regression coefficient. Тһе 
regression equation is y, = b'x, and, if we substitute Y, — Y for y,, and 
X — X for x, 

VSYO- А) ог Yeh. 
and 

Y, = (FB XJA Mx 4 


which can be put in the general form Y, = a + bX, where a = Y — b' X. 

It will be of interest to calculate the regression equation for the example 
of wheat yields and fertilizer, but we shall first perform an operation on 
the values of X that will reduce the work. This is merely to divide each 
X by 100, giving weights of fertilizer in pounds instead of hundreds of 
pounds, This is known as coding the data. Then, setting out the results 
in terms of deviations from the means gives us 


x - 1.5 — 0.5 0.5 1.5 
у EG si 2 5 
Then 
Xx? = (— 1.5)? + (— 0.5)? + (0.5)? + (1.5)? = 5.0 
and 


xy =(— 1.5) x (— 6) +(— 0.5) x (= 1) + (0.5 x 2) + (1.5 x 5) = 18.0 


giving 
b= TAE 3.6 


This tells us that, according to the regression equation, the expected 
increase in yield for 100 pounds of fertilizer is 3.6 bushels. The actual 
equation is 

Y, = 25 + 3.6(X — 2,5) 


or 
„== 25— (3.6 x 2.5) + 3.6X = 16.0 + 3.6X 


In order to draw the line on the graph only two points are required. 
Taking 
16.0 + 3.6 x 1 = 19.6 


16.0 + 3.6 x 4 = 304 
two points are obtained from which the line can be placed on the graph 


in the correct position. Note that the regression equation passes through 
X, Y, providing a check on the calculations. 
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3. Goodness of fit of regression line. Reference to Figure 6-3 will show 
that y,, the deviation of an individual variate Y; from its mean Y, is made 
up of two parts, ya; and Yei Thus 

Ji = Jai + уг 
and 
ZOP) = DO, + y)? = Ey? + Ey + 25у,у) 5 


FIGURE 6-3. Illustrating the origin of sums of squares for deviations from 
regression, and deviations of estimated value from the general mean, 


From the regression equation Y, — Y + bx we get Y,— Y — bx or 
J„ = bx. From the graph we know that Уа = y— Ye = y — bx. 
Therefore 

Хуа) = Uy — bx)bx = bUxy — БУ? 
From equation (3) it is seen that we can substitute bx? for Уху, and it 
is then obvious that 


Хоулу) = 0 6 
Thus the third term on the right in (5) vanishes, and we have 
Ey = Dy? + Уу 7 


which is а very important generalization. It shows that the total sum of 
squares of Y can be partitioned into one part representing regression and 
a second part representing deviations from regression. 

The next step is to decide on the degrees of freedom to be associated 
with each sum of squares. The clue is given by the number of statistics 
calculated from the sample in order to obtain Х(у,2), which in turn is 
wholly dependent on the regression equation, y, = bx. The regression 
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coefficient is the one additional statistic required to define this equation; 
therefore Х(у,2) is represented by 1 degree of freedom. This gives us the 
basis for setting up an analysis of variance as follows. 


55 DF MS 
Total Ху» n=l 
Regression Ey) 1 V, 
Deviations from regression Хо» n-2 Va 


` А test of significance is then given by calculating F = V,/V, and noting 


the 5% point of F for 1 and n — 2 degrees of freedom. This amounts to 
testing whether or not b differs significantly from zero. 
The significance of b can be tested directly by means of (ће г test. Let 


Ху 
5, = / ог) 8 
n—2 
where s, is referred to as the standard error of estimate. Then the 


standard error of the regression coefficient is 


ГЕС) 


= ———— = : 9 
МУ(х?) (n = DJE?) 
and 
(= b 10 
Sy 


It is easily verified that the two tests of significance are identical since 
F = 1? when the mean square tested is estimated from | degree of freedom. 
4. Fiducial limits of the regression coefficient. The general expression 
Гог t in testing a regression coefficient is 
_ ۵-۵ 
5, 
where f is the regression coefficient of any hypothetical population that 
we choose to take. In other words, we can take Û = 0 in the ordinary 
test of significance which involves the hypothesis that b has arisen by 
random sampling from a population in which f = 0. This allows us to 
set up fiducial limits for b. Thus 


+ 


В = bats, 12 
and to obtain the fiducial limits at the 5% point we take 
li =b— foo 1 = b+ tons 13 


where faos is the value of / given in the tables for p = 0.05 and degrees 
of freedom equal to n — 2. 
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5. Standard error of estimated Y. The term standard error of estimate 
should not lead one to think that it is to be applied to the accuracy of an 
estimate of Y. The standard error of estimate applies specifically, as we 
have seen in applying the t test, to variation in the regression coefficient. 
It may be necessary, however, to make a test of significance or set up 
fiducial limits for Y,. From the regression equation Y, = Y + bx it can 
be seen that the mean square of Y depends upon variation in Y and bx 
for which the separate mean squares are s/n and (s,?/Xx?)x?, and these 
must be added to obtain the mean square for Y, giving 


Me rc s qM. 
D ler qua ЕЗ ا‎ E "ха % 
For purposes of calculation it is convenient to put 
Y 2 2 
si WO (Ly 2) 15 
$ О TESTS 


From this equation is it clear that Sy? is a minimum when x = (X — X) 
= 0, that is, when the value of X for which Y, is being predicted is equal 
toits mean. As X varies from its mean, the mean square of Y, increases. 
The regression line represents estimates of values of Y for given values 
of X in the population sampled, and the standard error as given in (14) 
may be used to set up fiducial limits on both sides of the regression line. 
It is quite another problem to select a given X and use the regression 
equation to predict the corresponding Y. For example, a regression 
problem may involve studying the regression of weight on age for pigs. 
For pigs of a given age X, we may predict by means of the regression 
equation that their average weight would be Y,. This is a problem of 
estimation for the population, and to set up fiducial limits the standard 
error required would be that of equation (14). Now if we wish to select 
an individual pig of age X, and make an estimate of its weight, the answer 
would again be Y;, but the error of such an estimate is greater and is 
given by 
2 
EIE ++ ga) 16 
6. Linear regression when both variables are subject to error. In the 
example described above it was assumed that the data were taken from 
an experiment involving different levels of application of a fertilizer. 
Probably the more usual case is to have a series of pairs of values of the 
two variates X and Y, and it is required to determine the regression line 
for predicting Y from X. For this situation it should be noted that the 
assumption is that X can be measured without error. Suppose that X is 
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percentage disease infection and Y is yield and the data are obtained from 
a series of plots. It is assumed that the percentage infection on each 
plot can be determined accurately, and generally this is close enough to 
being true not to influence the value of the regression coefficient. How- 
ever, if we estimate the percentage of disease by means of a small sample, 
there will be a definite error in determining X and the regression coefficient 
will be biased accordingly. 

In this connection Yates [7] makes the following statement. “Errors 
in the independent variates do not affect the validity of the ordinary tests 
of significance, provided that only differences from zero are being tested, 
and the coefficients as ordinarily evaluated will always be the best coeffi- 
cients for the purpose of prediction from subsequent observations subject 
to similar errors.” He points out further that errors in the independent 
variable introduce a bias which generally reduces the absolute value of the 
regression coefficient. Therefore a value of b that is biased cannot 
strictly be compared with a theoretical population value £ if the dependent 
variable is subject to error. This would naturally affect the theory of 
setting up fiducial limits. The general problem of dealing with and making 
tests of significance when the dependent variable is subject to error is 
fairly complex, so it is not feasible to describe it in detail here. For a 
complete discussion and methods of fitting, refer to Deming [1]. 

7. Calculation of linear regressions for ungrouped data. The simplest 
situation occurs when there is a series of pairs of values of Х and У. We 
might have, for example, 


Ү, X, Af QUE о (5 


and wish to obtain the regression equation Y, = a + bx. Ordinarily we 
would not group the data unless we have about 50 or more pairs. A 
suitable method of calculation is to obtain directly from the paired values 
ХХ, ZY, and EXY. Then we calculate 


xe- ys CO 
n 

yyy 
xy = LY?— PRX) 

XX)XY 
Уху = ХХҮ- ene” 

and finally 
gis 


xx 
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A. test of significance can be made from the analysis of variance where 
the corresponding sums of squares and degrees of freedom are 


Уху)? 


Ey = 


F Ху 17 


5 
xx 


п-і- 1 +(n—2) 18 


If we wish to set up fiducial limits for b we determine s, as in formulas 
(8) and (9), and proceed as in Section 4. 


ression line 


Mean of first 
Y array 


FiGURE 6-4. Graph of means of arrays from a regression table, 


8. Linear regression for data arranged in a regression table. A regres- 
sion analysis can always be carried through by the method described in 
Section 7, but when there are more than 50 pairs it is more convenient to 
arrange the data in a table similar to Table 6-І. There is another 
advantage in setting up a table of this kind as it enables us to make a test 
of the significance of deviations from linear regression. 

Representing the regression table diagrammatically as in Figure 6-4, 
we can then show an enlarged section of one of the arrays of Y, as in 
Figure 6-5, together with a single observation. Then y can be broken 
up into у, + ya + y, where y, is the deviation of an individual observa- 
tion from the mean of the array, y; is the deviation from regression of the 
mean of an array, and y, is the deviation of the regression line from the 
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mean of У. It may be proved easily* that the total sum of squares and 
degrees of freedom can be expressed as 


N 4 4 4 т, 
nota же + Daye FID 19 
1 iu 
М-і- 1 + (4-2) -(М-Ф 20 
Total = Regression + Deviations + Error 


where n, is the frequency in a Y array, 4 is the number of arrays, and N 
is the total number of pairs of observations. From these expressions it 


ж Іп one array 
т, ny ny n 
zy = 20+ yor = 209 тоу? + 22у 


п 
and, since у, is constant for the array, and У», = 0, it follows that 
1 


4 
Over all 4 arrays where Ол, = М 
1 


N 4 п, 4 
Dy =X Dye Олай 
1 WE 1 
which is a familiar expression of the analysis representing the sum of squares for total 
within, and between arrays. The corresponding degrees of freedom are 
N—1-(N-Qt*Gq-V) 


It remains to show that the sum of squares for between arrays càn be broken up into 
two parts, representing linear regression and deviations from linear regression. 
Over all arrays 


4 4 4 4 а 
Dye = Dita + Ye)” = Zoe m Zn) > 22 n49uy2 
X 1 


which is equivalent to equation (5), Section 3, and would reduce to that equation if np 
4 " 
were constant for all arrays. Therefore Хл,(уауд = 0, and the complete expression 


for sums of squares with corresponding degrees of freedom is as given in (19) and (20). 
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can be seen that a test of significance can be made either for regression or 
deviations from regression. 

It is recommended that in a regression analysis the tests of significance 
as outlined be performed first. If the mean square for linear regression 
is not significant, there is no object in fitting a straight line. If the mean 
square for deviations from regression is significant, the indications are 
that some function other than that of a straight line will give a better fit. 


FiGURE 6-5. Section of regression graph showing deviation of individual 
yariate from the mean of Y, broken down into three parts. 


TThis is of course true even if the mean square for linear regression is also 
significant. 

9. Calculations for a regression table. The easiest way to learn how 
to do the calculations is to follow through Example 6-2. This example 
explains also how to proceed in setting up the table itself. 

In many examples the labor of calculation can be reduced by coding 
the data. This involves either subtracting a uniform quantity from the 
values of each individual variate or dividing by a constant quantity, or in 
certain cases both devices may be employed at the same time. Suppose 
that the actual values are as given below on the left; the values on the 
right are examples of how the coding may be carried out. 


Uncoded Coded 


x 01 x JE 
2402 2785 240 278 Dividing by 10 and rounding off last figure 


40 78  Subtracting 200 


198 196 8 6  Subtracting 190 
195 193 5 3 
256 274 56 74 Subtracting 200, but in general avoid negative 


229 198 29 2 values if a machine is available for calculation 
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10. Example 6-1. Regression analysis on a series of paired values. In 
a hypothetical example the values for 10 pairs of variates are as given 
below. 


ag "Rc TENE 33. CL .22 XO eter SU 


Y. 0059 as Su 700 31 6 Le Yaa 1 535—752 TY 


To find the sums of squares of both sets of variates and the sums of 
products, we proceed as follows. 


UXY =(9 x 7) + EXIF FE O x 2) = 321 
vYY P 
ХХХҮ 50х52 —26ф 
N 10 wares 
Difference = Хху -- 61 
Dep EIS SEC) Ee aas p = 346 
p — 2704 
N 10 : 
Difference — Xy? — 75.6 
EDX 9? + 8905.22 — 318 
xy _ GO" — 250 
NE E RS 


Difference — Xx? — 68 


Then the sum of squares for linear regression is 


У 2 2 
(әу (6D ayn 
xx 68 


We can then set up the analysis of variance. 


SS DF MS Р 


Regression 54.72 1 54.72 20.97 
Error or deviations from regression 20.88 8 2.610 
Total 75.60 9 


Since the regression is very significant, we proceed to obtain the regression 


coefficient given by 
Уху бі 
Dye = а 0.8971 
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and the regression equation is 
Y, = (Y— bX) + bX 
= (5.2— 0.8971 x 5.0) + 0.8971X 


= 0.7145 + 0.8971Х 


The fiducial limits of b,, at the 5% point are given by 
= 6 — 10055 la = b + 10055 


where 5, = %20.88/(8 x 68) = 0.1959 and fo; is 2.31, the value of г at 
the 5% point (ог т = 1 and ng = 8. Here we have | 


l, = 0.8971 — 2.31 x 0.1959 = 0.4446 
1, = 0.8971 + 2.31 x 0.1959 = 1.3496 


The standard error of estimates of yield for different levels of fertilizer 
can be determined from equation (15). In general 


Зу? ae (5 el 0.2610 + 0.0384; 
Then for x — 0 зү, = V 0.2610 = 0.511 
x=2 Sye = V0.4146 = 0.644 
x=4 зү, = V0.8754 = 0.936 


From these calculations it is evident that there is a much greater possibility 
of error in estimating values of Y near the ends of the regression line. 

11. Example 6-2. Regression analysis for a large series of paired values 
set up in a regression table. When dealing within a large number of 
variates, it is convenient to make up a frequency table in order to sum- 
marize the data and reduce the labor of calculating the mean and standard 
deviation. Similarly, in regression studies, for a large series of paired 
values, it is desirable to make up a table that combines the frequency 
tables of the two variates. In the next chapter we shall see also that this 
table is of value for calculating the correlation coefficient. 

To prepare the table it is convenient to copy the paired values on cards 
of a size that can be handled easily. Thus if we decided to make up a 


(11) CALCULATIONS FROM REGRESSION TABLE 115 


table for the carotene data of Table 6-4, the cards for the first three pairs 
would be as follows. 


First Card Second Card Third Card 
X 1.18 X 2.13 X 1.41 
Y 2.39 Y 3.11 Y 2.15 


After deciding on the class values as outlined in Chapter 2, Section 9, 
distribute the cards for one of the variables, preferably Y, and then 
distribute each pile for the second variable. As each pile is distributed 
the frequencies can be entered in the appropriate cell of the table. 

Table 6-1 was prepared in the above manner for readings on the 
transparency and area of 400 red blood cells from a normal patient 
(Savage et al. [4]. The transparency values are in arbitrary units based 
on deflections of a galvanometer operated by a photoelectric cell. Notice 
that in the margins the actual class values have been replaced by a series 
of natural numbers beginning with 1. This is a form of coding which 
reduces the labor of calculation. 

After setting up the regression table it is convenient to set up a table 
similar to Table 6-2 for entering values to be used later. For the column 
headed “Totals, Y Arrays" proceed to obtain the totals for each array as 
follows, where the first array of Y is the distribution of the Y variates 
that fall in the first class for X. 


First array (1 x 7) + Q x 4) = 15 
Second array (2 x 8) + Q x 7) + (3 x 6) + (1 x 3) = 51 


etc. 


The total for this column is obviously EY. In the same way obtain the 
totals for the Х arrays and ХХ. There are two columns headed XXY, 
the object being to calculate ZX Y in two ways in order to have a complete 
check on the calculations. The entries in the first УХУ column are 
obtained by multiplying the totals for the Y arrays by the corresponding 
class values of Х. In the second ZX Y column multiply the totals for the 
X arrays by the corresponding class values of Y. We have to calculate 
also X Y? and DX, the method being the same as outlined in Chapter 2 
for any frequency distribution. Ifa calculating machine is available, it 
is most convenient to summate these as we multiply frequencies by squares 
of corresponding class values. ХХУ may also be obtained by a process 
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of continuous multiplication and summation on the machine, and in that 
case the individual entries as in Table 6-2 can be omitted. 
In this example we have 


IXY = 18,345.0000 X Y? = 27,509.0000 ZX? = 13,337.0000 


> хү) EX 

LE = 17,679.5325 ay = 25,808.4225 1229; -- 12,111.0025 
N N 

Уху = 665.4675 Zy? = 1,700.5775 УХЕ 11222219975 


At this stage we may omit the calculations leading to the test of the 
significance of deviations from regression, in which case the regression 
analysis can be set up directly after calculating the sum of squares for 
regression given by 


SS (regression) = “ый = 2600 ^ 361.22 


Analysis of Variance 


55 DF MS 
Regression 361.22 1 361.22 
Error 1339.36 398 3.3652 
Total 1700.58 399 


The F value is very large, so its calculation can be omitted. 
The regression Coefficient is given by 


= = реа = 0.5428 
and the regression equation is 
Y, = (¥—b,,X) + bX 
= (8.0325 — 0.5428 x 5.5025) + 0.5428Х 
= 5.0457 + 0.5428 X 


If a graph is required, calculate two values of Y,, preferably for X = 1 
and X — 12, and locate the regression line accordingly. The standard 
error of Б is given by 


3.3652 T 
= = V0.002745 = 0.052 
Sp ТЕ 70.0027 0.0524 
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The fiducial limits of b,, at the 5% point are 
l, = 0.5428 — 1.97 x 0.0524 = 0.4396 
1, = 0.5428 -+ 1.97 х 0.0524 = 0.6460 


In order to make a test of deviations from regression we require the 
total of the last column of Table 6-2. Each entry in this column is the 
square of the total for a Y array, divided by the corresponding frequency. 
Thus 


2 2 


15 1 
75.00 = —— 325.12 = BE etc. 
3 8 è 
This gives the sum of squares for between arrays as follows. See 
equation (19). 
in, Y; = 26,181.23 


сые 25,808.42 
NEC moa b 
Inya = 372.81 


The complete analysis can then be set up. 


SS DF MS 
Regression 361.22 1 361.2 
Deviations from regression 11.59 10 1.159 
Error 1327.77 387 3.431 
Total 1700.58 


The deviations from regression are less than expectation, so there is no 
object in calculating a value of F. 

The depression in deviations from regression can in fact be shown to 
be statistically significant. The ratio 3.431/1.159 = 2.96 exceeds the 5% 
point for 387 and 10 degrees of freedom. Such a result should lead the 
experimenter to a closer examination of the data in order to find the 
reason for the unusually close fit of the means of the arrays to the straight 
line. In this example, since we have only the bare figures and no details 
of the experiment, we can only conclude that the close fit may be merely 
a random occurrence. 

It should be pointed out here that, had the deviations from regression 
proved significant, the method of setting up of the fiducial limits for b 
would not have been quite appropriate since the sum of squares used for 


TABLE 6-1 


DATA FOR REGRESSION OF TRANSPARENCY (Y) ON AREA (X) FOR 
400 HUMAN BLOOD CELLS 


Frequency X 3 8 36 65 96 90 53 31 10 4 3 1 400=n 
| 
14 | 0.691-0.720 1 1 
13 | 0.661-0.690 0 
12 | 0.631-0.660 DAE у €T 1 8 
11 | 0.601-0.630 ССС e 502) 53 29 
10 | 0.571-0.600 45. I3. 716 lal 3 1 | 65 
9 | 0.541-0.570 4209244 201 1556. 3 71 
8 | 0.511-0.540 274513620. 9221903 3» 72 86 
7|0.48-0510| 1 2 6 10 17 10 4 4 54 
6 | 0.451-0.480 АОК ЕТЕ 287 24 36 
5 | 0.421-0.450 A127 52 26 
4 | 0.391-0.420 | 2 dy Te QUSS 15 
3 | 0.361-0.390 Е Е СЕ I 7 
2 | 0,331-0.360 0 
1 | 0,301-0.330 D t 2 
Actual àags5 Ş 22 82 G|Frequenc 
к И Ул Ды О d Eie cad 
a е m mM na + ч ч e Ба) n 
Coded classes 19342 937 ELS УОТУ 8.29.1 IOI AF 
TABLE 6-2 


CALCULATIONS FOR REGRESSION ANALYSIS, DATA OF TABLE 6-1 


Totals, Totals, 55 


Бк p PEARY Uy Cry Y ^ ЖҮ) BEEN 
Arrays Arrays Arrays 
1 2 1 3 15 АК 7 759 
2 0 2 8 sj 210227 70 0 32512 
3 7 222% 239 74 28 84 1,573.44 
4 15 о) 45 10860 Өө 20 31326154 
5 26 Fon 196 750 3,50 10 550 5,859.38 
6 36 ӨМ 90 750 4554 169 1,014 640090 
7 54 25123 464 — 3248 268 1,876 4,062.19 
8 86 ТІРІ 285 220 465 3,720 262016 
9 71 9470 95 85 0 3780 90250 
10 см іо 4 46 40 403 200 52900 
11 э и 3 33 38 204 224 36300 
12 8 12 1 1 14 54 68 140 
13 0 0 
14 І 10 шо 
400 400 3013 18,45 2201 18,345 26,8123 


n n Ey tL DAVY С PDO ҚҮР 
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the error term included deviations from regression. With the present 
analysis the procedure of obtaining fiducial limits for b would be to use 
the error mean square 3.431. Thus 


A ES. — 0.0529 
IDG ^ 7) 


1 = 0.5428 — 1.97 х 0.0529 = 0.4386 


Then 


1, = 0.5428 + 1.97 х 0.0529 = 0.6740 


Since the deviations from regression are not significant, the fiducial limits 
agree closely with those obtained previously. 


12. Exercises. 


1, Table 6-3 gives the results obtained in an experiment with 25 wheat varieties on 
the number of days from seeding to heading and the number of days from seeding to 
maturity. Calculate the regression equation for the regression of days to mature on 
days to head, and test the significance of the regression coefficient. Code the data 
before beginning your calculations by subtracting 50 from the days to head and 85 
from the days to mature. Find the fiducial limits at the 5% point of the regression 
coefficient, and decide as to the practicability of substituting days to head for days to 
mature, on the basis of the data provided by this sample. 

Regression coefficient = 105.23/125.68 = 0.8373 (coded data). 


TABLE 6-3 


Data on Days To HEAD AND Days To MATURE OF 25 WHEAT VARIETIES 


Days to Days to Days to Days to 
Variety Head Mature Variety Head Mature 
60.0 94.4 14 58.2 92.4 


1 
2 53.6 89.0 15 58.0 91.6 
3 59.0 94.0 16 59.4 94.0 
4 61.8 95.4 17 55.4 90.8 
5 53.8 88.2 18 61.6 95.2 
6 57.8 93.4 19 6:05 912 
7 57.8 93.6 20 60.2 94.6 
8 58.4 92.0 21 61.6 96.0 
9 57.8 92.8 22 57.6 92.6 
10 59.0 93.4 23 60.8 95.4 
1 59.2 93.8 24 61.2 94.4 
12 59.0 92.8 25 58.2 94.0 
13 58.6 94.2 


TABLE 64 


CAROTENE CONTENT OF FLOUR AND WHOLE WHEAT FOR 139 VARIETIES 


Variety Carotene Carotene Variety Carotene Carotene Variety Carotene Carotene 

in Flour in Wheat in Flour in Wheat in Flour in Wheat 
2.39 1.18 48 1.71 1.16 95 1,97 1,33 
3.11 2.13 49 1.93 1.14 96 1.83 1.14 
2.15 1.41 50 1.81 1.30 97 2.00 1.51 
1.96 1.42 51 1.89 1.32 98 1.96 1.28 
2.02 1.50 52 1.65 1.32 99 2.00 1.33 


1.76 1.25 53 1.93 1.28 100 2.02 1.32 
2.10 1.65 54 2.12 1.48 101 1.78 1.17 
2.12 1.24 55 2.25 1.50 102 1.83 110 
2.28 1.48 56 1.92 1.42 103 1.93 1,22 
10 1.86 1.35 57 2.25 1.66 104 2.14 1.44 
11 2.60 1.58 58 2:25 1.63 105 215 1.54 
12 2.11 1.45 59 1.65 1.18 106 2.3 1.46 
13 2.30 1.74 60 1.63 1.14 107 1.97 1.40 
14 1.80 1.42 61 1.70 1.22 108 1.83 1.11 
15 2.00 1.45 62 1.61 1.20 109 2.10 1.40 
16 2.05 1.87 63 1.83 1.33 110 1.84 1.19 
17 2.09 2.00 64 1.60 1.13 111 1.98 1.39 
18 2.3 1.65 65 1.37 0.92 112 2.31 1.60 
19 2.29 1.64 66 1.96 1.20 113 2.29 1,53 
20 2.30 1.62 67 1.88 1.26 114 2.15 1.45 
21 1.97 1.55 68 1.92 1.34 115 1.96 1.44 
22 2.36 1.68 69 1.89 1.04 116 1.98 1.40 
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23 1.73 1,32 70 1.99 1.26 117 1.89 1.30 
24 1.72 1.47 71 1.82 10.98 118 2.08 1.33 
25 1.70 1,53 72 2.12 1.31 119 2.00 1.42. 


26 1.63 1.50 73 2.16 1.16 120 2.06 1.44 
27 1,93 1.48 74 2.14 1.04 121 1.96 1.36 
28 1.50 1.25 75 1.63 0.88 122 2.07 1.38 


29 1.77 1.33 76 2.76 1.91 123 2.24 1.51 
30 1.60 1.40 Inf 2.07 1.20 124 2.15 1.38 
31 2.31 1.49 78 1.67 1.07 125 1.83 118 
32 CD 1.42 79 2.78 1.80 126 1.84 1.20 
33 2.10 1.35 80 3.40 2.02 127 2.03 1.45 
34 2.90 1.58 81 3.67 2.10 128 1.87 1.05 
35 2.17 1.50 82 2.41 1.61 129 2.24 1.44 
36 2.15 1.40 83 2.23 1.38 130 2.14 1.06 
37 2.01 1.40 84 3.07 1.93 131 2.13 1.10 


38 2.35 1.67 85 222 1.44 132 2.03 0.98 
39 2.34 1.62 86 2.55 1.58 133 2.25 1.31 
40 2.00 1.47 87 2.12 1.39 134 2.33 1.08 


4l 2.18 1.55 88 1.94 1.27 135 2.01 1.14 
42 2.47 1.73 89 1.95 1.41 136 1.89 1.41 
43 2.25 1.62 90 1.59 1.08 137 3.00 2.20 
44 1.77 1.39 91 2.00 1.30 138 2.16 1.73 
45 1.68 1.34 92 1.77 1.22 139 2.29 1.61 


46 2.46 1.29 93 1.98 1.26 
47 1,86 1.28 94 1.97 1.30 


(12) EXERCISES 121 


2. Table 6-4 contains data on the carotene content determined by two methods for 
139 wheat varieties. By one method carotene was determined on the whole wheat, 
and by the other method, on the flour. The figures for carotene in the wheat are lower 
than for carotene in the flour, which is of course the reverse of the actual condition. 
This was due to a different method of extraction for the whole wheat which gave lower 
but relative results. 

Make out cards, one for each pair of values, and prepare a regression table, letting 
the flour carotene represent the dependent variable Y. In order to reduce the labor 
of calculation make the classes fairly large; for example, let the first class for X be 
0.85 to 0.95, and the first class for Y be 1.33 to 1.49, Also do not forget to replace 
the actual class values by the natural numbers, beginning at 1, before going ahead with 
the calculations. Determine the regression equation, and prepare a regression graph. 

by, = 438.39/665.96 = 0.6583 (coded data). 


ExXY 
3. Prove (a) Exy = EXY — 


(b) X(Y — Y)! = Ey? — Ь,„2Ух? 
= Ху? — bXxy 
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CHAPTER 7 


Correlation 


1. Covariation. This term is very expressive of the fundamental 
‚ situation regarding two variables, from which the methods of correlation 
arise. In the previous chapter it was pointed out that, when two variables 
are so related that one may logically be considered as dependent on the 
other, the methods of regression are completely applicable to an analysis 
of this relation; but, when the two variables cannot be considered as 
dependent and independent, regression methods are not fully satisfactory. 
Suppose that a study is to be made of the relation between the heights of 
brothers and sisters. It would not be logical to consider the height of 
one member of the pair as being dependent on the height of the other, 
yet we may be fairly certain that the two variables are in some way 
related, and wish to measure the relation and be able to compare it in 
some logical manner with other similar relations. The question most 
frequently asked with respect to two such variables is: To what extent 
do the heights of brother and sister vary together? Thus we have the 
term covariation, and the conventional statistic for the measurement of 
covariation is the correlation coefficient. 

2. Definition of correlation. Table 7-1 presents three sets of figures 
that may be taken as measurements on 2 variables which we shall designate 


TABLE 7-1 


THREE SAMPLES OF PAIRED VARIATES ILLUSTRATING THE PHENOMENON 
OF CORRELATION 


Setl X, "ACTU. СЗ 8509-3 Ма 50, 
X, 5:20. 762 1 «3- 129" 4. 6. 8) "Total 52 
542 Х 97 8 qq 086. 5 25 3 1271 Total 50 
X, 9070, 58-6116 250 4723-1717 оар 52, 
Set 3 X p- 3 35 5746: Яс DR SO E 
X, 9.2018. 60,6 Ж 427322101 a 


as X, and Х,. On examining these, it will be noted that the relation 
between X, and X, is different in each. In set 2 high values of X, are 
122 
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associated with high values of X,, and in set 3 high values of X, are associ- 
ated with low values of X,. In both sets there is an obvious relation, but 
one is the reverse of the other. In set 1, on the other hand, the 2 variables 
do not appear to be related. These sets may be regarded as samples from 
infinite parent populations of paired variables. In the population from 
which set 2 is drawn, whenever a pair of variates is selected at random, 


Set 1 Set 2 


X, 
Ficure 7-1. Scatter diagrams for data of Table 7-1. 


we expect to find, if the pair contains a high value of X,, that a high value 
of X, will be associated with it. In the population represented by the 
sample in set 3, it is to be expected that high values of X, will be found 
associated with low values of Xj. These two opposite situations аге 
referred to as positive and negative correlation. Set 1 represents still 
another situation. High values of X, do not appear to be associated 
with either high or low values of Хз. Іп other words, it is to be expected 
that in the parent populations the 2 variables vary independently. A 
graphical picture of the results with these 3 samples is given in Figure 7-1. 
Each sample is represented by what is usually known as a dot or scatter 
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diagram. The values of X, are represented as ordinates and the values 
of Х, as abscissas, so that each pair сап be represented by a dot on the 
diagram. The final result is a figure that represents in a general way, by 
the scatter of the dots, the relation between the 2 variables. For set | 
the dots are scattered more or less uniformly over the whole surface. 
For sets 2 and 3 there is a definite relation between the variables, as shown 
by the tendency for the dots to arrange themselves in a straight line along 
the diagonals of the square. We are reminded here of the regression 
graphs of the previous chapter. The difference is that we are now studying 
not the effect of one variable on the other, but rather the degree to which 
the variables vary together owing presumably to influences that are 
common to both. If such measurements represent heights of brothers 
and sisters, it is apparent that this common influence might be the simi- 
larity of their genes. 

3. Measurement of correlation. Figure 7-2 illustrates the shape of the 
swarm in a correlation surface for three different degrees of correlation. 


A B 
Gm а) 


x, A} - a 
x, (9) - 2) x 109) si (2) 
ра X, 
Х,---- Х.----> 


Figure 7-2. Correlation surfaces showing variation in shape of swarm 
with increasing correlation, 


The circular swarm at A represents zero correlation; in C the swarm falls 
entirely on the diagonal and must represent perfect correlation; B shows 
acondition between the other two extremes. Now each surface is divided 
into quadrants by lines erected at the positions of the means, and in each 
quadrant are plus and minus signs representing the signs of the products 
of the Y, and X; deviations from their means. Thus іп the upper right- 
hand quadrant (1) the deviations of X, and X, are both positive so that 
the product of the deviations is positive. Therefore products are positive 
in quadrants (1) and (3) and negative in (2) and (4). Now it is obvious 
in A that the plus and minus products will cancel each other and the sum 
will be zero. In C all the products will be positive so that their sum will 
be а maximum. In В the condition is intermediate between A and C. 
The plus products are greater than the negative products; hence we have 
a positive but not a perfect correlation. 
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Let us consider now the sets of figures in Table 7-1. On calculating 
the sum of products &x,x, for each set, we find agreement with the theory 
outlined above. To carry out these calculations we make use of the 
identity 


p» e» 
yx, ХХ, = 20208 
n 


where ZX, is the sum of the X; values, ZX, is the sum of the X; values, 
and n is the number of pairs. The calculations come out as follows 


x33 
УХХ п Ухх 
Set 1 262 260 2 
Set 2 335 260 75 
Set 3 186 260 - 74 


The result is іп perfect agreement with the theory that the sum of products 
is a measure of correlation. 

The sum of products is an absolute measure of correlation, but it will 
not serve as a relative measure because it is dependent on factors that are 
characteristic of the data for the individual variables and have nothing 
to do with the correlation between them. It depends on the number of 
pairs of variates measured, on the units of measurement, and on the 
variability of both variables. The first objection is overcome by dividing 
by the degrees of freedom, giving Xxyxs/(n — 1) which we now define as 
the covariance. It сап be represented by the symbol м. То overcome 
the other difficulties we express the variates in what is known as standard 
measure. If a variate is represented by X, in standard measure it is 


m uci 


d 


Sx SX 
where s is the standard deviation of the sample. Thus we have finally 


У(ху/зу)(%/5ә) _ Х(хух»)/(п— 1) и» 


n—1 5159 5158 


Another formula, giving the variances of X, апа Х, in place of their 

standard deviations, is 

2% X2)/(n — 1) 
Уз» 


мә = 
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where v, is the variance of X, and v; is the variance of Хо. Since the 
regression coefficient 

Xxx,  Xxxy(n—1) и 

Xx? — Exj(n—-l) v 


biz 


and 


formula (3) shows also that 


и и 

аа. EL 

Ty? = — X — = bba 
UT ANA 

and 


Tj = Уау \ 4 
The correlation coefficient is therefore the geometric mean of the two 


regression coefficients. 
4, Range and interpretation of the correlation coefficient. If we take 


х х, 
d=2 and а-- 
5 8 


the correlation coefficient сап be written neatly as 


(dda) 
гіз = ат 5 


and we can then prove very easily that* 
X(d, 4) 
PRAE E 
2(n — 1) 
' * Expanding the term оп the right іп (6), we have 
Ed E: Xd? — Xdd, 
Аалу да) ET 


тә = 1 
Now 

Xd? ZQuls =x? 1 

An—1) 20-1) 2540—10) 2 


because 5,2 = Xaj?/(n— 1). Similarly 
Zd? _1 
2n—1 2 


Then 


(1,1 Sadi) хаа) 
m. 2% NOS 
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and also that 

X(d, + ау” 
2(n— 1) 

Equation (6) shows us that, since Z(d, — d;* cannot be less than 0, rj; 


will have a maximum value of + 1. Equation (7) shows that r, will have 
a minimum value of — 1. Therefore, in general 


Te =—1+ 7 


І>ір>-і1 8 


which defines the complete range of values that the correlation coefficient 
can take. 

The relation between correlation and linear regression can be extended 
from that given in Section 3 by reverting to the equation for the sums of 
squares of the dependent variable as given in Chapter 6, Section 3. The 
fundamental equation, considering X, as the dependent variable, is 


Ix = Ум + Exe 9 
where xa represents a deviation of a value of X, from the regression line 
defined by xı, = Әз» and x, represents the value of x, estimated by 
the regression equation. The regression coefficient by. is of course given 
by Ххх Ух. 

Since ху, = bax, it follows that 


Хх = б DX 
Substituting УЎхухз/ ху? for бұр, we have 
Xx 
and, since rj? = (Ххх) Хх Ex, by definition 


б = пух? 
Уху = т 


г == 
Ух = 


The equation for sums of squares can then be written 
Ух = Dry? + Fg Dx; 10 
= (1 — па)? + та Dx 11 


for which the corresponding degrees of freedom are 


жї ә 
This leads to an analysis of variance giving a test of significance of ry». 
Obviously Я 
= тп — 2) . 12 


2 
1— лу 
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and this is exactly the same value as is obtained if we calculate F in order 
to test the regression coefficient. This analysis shows that essentially the 
correlation coefficient is a measure of linear regression. 

The relation of the correlation coefficient to linearity of regression can 
be illustrated further by dealing with equation (9) in a somewhat different 
manner. If we substitute r,.x,? for Ххі,2, we һауе 


тх = Xx$-— Ix," 13 
and 
Xx 
niei xi 14 


Similarly, by substituting (1 — r,5?)Zx;* for Уху, we can get 


2 
Ух, 


T3945 ix? 15 


We note first from (14) that r} will approach a maximum value of 1 as 
the sum of squares for deviations from regression approaches zero. 
Equation (15) shows that r, is merely a ratio of the sum of squares for 
regression to the total. The latter is a very useful concept since it can be 
taken as the beginning of a generalized concept of correlation representing 
any type of regression. Thus, if a more complex equation representing 
a curve is fitted, we would have a series of estimated values of X, lying on 
acurve. Extending our concept of r for linear regressions, we would be 
able to put 
г? = mue 
xx? 


where Xx’, would now represent the sum of squares for the regression 
of the curved line. Such coefficients are usually called indices of curvi- 
linear correlation. 

Equation (11) is of particular importance in interpreting a value ofa 
correlation coefficient. The sum of squares for the dependent variable 
is split up into two portions, one part representing deviations from 
regression which is proportional to 1— гз, and one part representing 
regression which is proportional to r. An approximate but valuable 
concept is to assume that r}? measures the importance of a correlation. 
For example, we may һауе a coefficient rı = 0.90 measuring the relation 
between yield and soil moisture, and by a different method of treating 
the plots we are able to raise this to 0.95. The increase in ғр? is 
0.952 — 0.902 = 0.0925. If the initial coefficient had been 0.40 and it is 
increased to 0.50, the difference is still 0.09; therefore at this level an 
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increase of 0.1 is of no more importance than an increase of 0.05 at the 
higher level. 

5, Sampling distribution of the correlation coefficient. Suppose that we 
have a population in which the correlation coefficient pj; = 0 (the symbol 


/ 


r=0 r=09 
сан] 
= 10 0 10 -10 0 10 -10 0 10 
Sample size n = 4 
2 r=0 a т= 0.9 
p = 27 = A 
-10 0 10 -10 0 10-10 0 10 


Sample size n = 10 


r=0 "| r.0g N 
M. 12 | 
0 0 10 -10 0 10 


-10 0 10 -1 


Sample size n = 25 
Figure 7-3. Distribution of the correlation coefficient for different values 
of p and for different sample sizes. 


р» replaces гу, because it is the population parameter of which ry is an 
estimate), and we take a large series of samples and determine rp for 
each. If the values of rj, are set up in a frequency distribution, we shall 
find that the distribution is symmetrical although it approaches normality 
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only when the samples are large. On the other hand, if py, = 0.8, the 
distribution of rj; will be decidedly skewed. Figure 7-3, which is adapted 
from Shewhart [3], illustrates this phenomenon not only for different 
values of p but for different sizes of samples. 

The nature of the distribution of r is important in making tests of 
significance and determining fiducial limits. The usual test of significance 
based on the distribution of t is not affected because our null hypothesis 
is that рі; is zero in the population sampled. Assuming a population 
for which p, is zero, we wish to find the probability of obtaining a value 
of rı, equal to or greater than the one observed. We determine 


(Уп 2 
Vt— rng 


which is equivalent to 75/s,,, where s,,, is the standard error of г. We 
should note here that t= VF, as determined from the analysis of 
variance. From the tables of ¢ we then find the corresponding value of 
P for n — 2 degrees of freedom. This test works successfully because the 
distribution of t as determined from a series of values of ry) taken from 
a population in which ру; = 0 behaves in a typical fashion and іп accord- 
ance with theory. 

However, if p, = 0.8, the distribution of t calculated from the samples 
will not be in agreement with the distribution on which the tables are 
based owing to the skewness of the distribution of ғ. The ż distribution 
is not satisfactory, therefore, for a test based on 


16 


Tig — Piz 
S, 


Tis 


t= 


unless ру» = 0. It follows also that the distribution of ¢ is not suitable 
for setting up fiducial limits for rg unless ру» is equal to or close to zero. 

In order to overcome this difficulty, R. A. Fisher [1] has developed a 
transformation of r to a statistic z which is approximately normally 
distributed. We determine 


жік! itr) 
z=} tog (T+! 17 
and 
1 
$, = 18 
Vn—3 

Then 2 

nee = 19 
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giving 

ly = z— 10% 

la = 2 + 1055: 
and, on transforming /, and J, back to values of ғ, we have the fiducial 
limits at the 5% point. The value of г in these tests arises from assuming 
the degrees of freedom to be infinite. Accordingly a table of the normal 
deviate will give the same result as the table of t. 

The z transformation is also useful for testing differences between 

correlation coefficients. Thus 


1 1 
cx 20 
S5-2 qu 


where л, and ny are the numbers of pairs of variates in the samples from 
which the two correlations were determined. Then 


A 188 21 
S5-2 

In making a test of this kind it should be remembered that the two values 
of the correlation coefficients must not in themselves be correlated. For 
example, we might calculate the correlation coefficient for insect damage 
and yield in a series of plots, taking portions at random from each plot 
for the determinations. Then, if a similar correlation coefficient is 
obtained at a later date by sampling the same set of plots, the two coeffi- 
cients will be related and the test described here will not apply. However, 
if the correlations are obtained from two entirely different sets of plots, 
the coefficients can be regarded as independent and the test is valid. 

When pairs of correlations or linear regressions are in themselves 
correlated and it is necessary to make tests of significance of differences, 
a method has been developed by Yates [5] which is quite satisfactory. 
The reference will be found at the end of this chapter. For the simple 
case where we have one value of X, and two sets of values of the inde- 
pendent variate X, and Хз, and гь and гу; are to be compared, it is 
satisfactory to determine a new variate X, = (Xs — X3) and correlate X, 
with Y, Yates discusses this problem in the paper cited as well as the 
more difficult one where there are two sets of both variables, Х” and X^, 
as well as X, and X;, and it is known that the correlations or regressions 
are correlated. 

6. Calculation of the correlation coefficient. If the student has worked 
through the methods of calculating the regression coefficient given in 
Chapter 6, it will be obvious that these methods can be applied to the 
calculation of the correlation coefficient. For the regression coefficient 
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we required merely Уху and Xx?, whereas for the correlation coefficient 
we require in addition Ху” In terms of the symbols of this chapter 
we need 


>Х,ХХ, 

Xxx,— EX, X, — —À—— 
XX 
Dx? Ух (лу 
(2х)? 


Ix? = IX? 


These can be determined directly from the paired values if there are about 
50 pairs or less, but for more than 50 it is usually desirable to set up a 
regression table as in Chapter 6. 


7. Exercises. 


1. The figures in Table 7-2 are physics and English marks* for home economics 
students at the University of Manitoba. Determine the correlation coefficient for 
the relation between the marks in the two subjects. Calculate by the direct method, 
and test the significance of the coefficient. r= + 0.634. t= 5.67. 


TABLE 7-2 


MARKS IN PHYSICS AND ENGLISH OF 50 STUDENTS IN HOME ECONOMICS 
AT THE UNIVERSITY OF MANITOBA 


Student Physics English Student Physics English Student Physics English 


1 20 21 18 26 29 35 23 26 
2 25 26 19 24 27 36 26 27 
3 24 27 20 19 26 37 22 21 
4 22 24 21 25 25 38 26 25 
5) 27 27 22 18 20 39 23 21 
6 26 28 23 20 24 40 29 26 
if 26 24 24 23 24 41 23 19 
8 32 26 25 22 20 42 20 19 
9 22 24 26 23 26 43 23 30 
10 27 26 27 31 27 44 33 32 
11 22 23 28 24 25 45 21 19 
12 22 25 29 28 30 46 28 30 
13 22 24 30 25 28 47 24 21 
14 29 30 31 26 32 48 26 28 
15 26 30 32 24 25 49 24 22 
16 24 28 33 27 30 50 30 25 
17 25 29 34 28 30 


* By courtesy of the Registrar, University of Manitoba. 
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2. Determine the correlation coefficient for days to head and days to mature of 


25 wheat varieties, for the data of Table 7-3. Find the fiducial limits at the 5% point 
for this coefficient. r= + 0.946. l, = 0.916. 1, = 0.972. 


TABLE 7-3 


Days To HEAD AND Days To MATURE OF 25 VARIETIES OF WHEAT 


Days То Days То Days To Days To 

Variety Head Mature Variety Head Mature 
1 60.0 94.4 14 58.2 92.4 
53.6 89.0 15 58.0 91.6 
3 59.0 94.0 16 59.4 94.0 
4 61.8 95.4 17 55.4 90.8 
5 53.8 88.2 18 61.6 95.2 
6 57.8 93.4 19 63.0 97.2 
7 57.8 93.6 20 60.2 94.6 
8 58.4 92.0 21 61.6 96.0 
9 57.8 92.8 22 57.6 92.6 
10 59.0 93.4 23 60.8 95.4 
11 59.2 93.8 24 61.2 94.4 
12 59.0 92.8 25 58.2 94.0 

13 58.6 94.2 
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CHAPTER 8 


Partial and Multiple Regression 
and Correlation 


1. The partial and multiple regression problem. In many regression 
problems the investigator is concerned purely with the effect of one variable 
on another. Suppose that a new method has been developed for deter- 
mining the protein content of grain and that this method is to be compared 
with an older and thoroughly tested method known to give very accurate 
results. .Тһе two methods are tried out on a large series of samples, and 
the regression is determined for the results by the old method on those 
for the new method. The object here is to see how closely the protein 
content of a sample can be predicted from results obtained by the new 
method. It does not matter in this example whether the new method 
actually measures protein or some other characteristic closely associated 
with protein. A high degree of accuracy in prediction is all that is required. 

In examples of a different nature we may be intimately concerned with 
the effect of more than one independent variable on the dependent 
variable, and total relations as given by simple regression coefficients may 
not give satisfactory information. Suppose that we find that the regression 
of yield of wheat on rainfall is represented by an increase of 1 bushel for 
every tenth of an inch of rainfall. Can we conclude from this that a 
given increase in rainfall will always cause a given increase in yield? 
The answer is no, because yield may be influenced by another factor, such 
as temperature, which may be different for another set of data obtained 
under different conditions. Also there may be a significant regression of 
rainfall on temperature, and it is difficult to decide whether the yield 
increases observed are due to one factor or to a combination of both 
factors. 

What we require in examples of this sort is a measure of the relation 
between two variables when the other variable is constant. This is the 
purpose of partial regression and correlation methods. 

2. Partial regression methods. As simple or total regression is based 
on the derivation of a regression equation of the form 


Y,=a+bX 
134 
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so partial regression is based on a regression equation of the form 

Y, = a + bX, + Хх +° * ЕЬХ, 1 
where there are p independent variables and the regression coefficients 
by, by, * * *, b, are referred to as partial regression coefficients. The 
equation is actually an equation of multiple regression as it represents a 
method of predicting values of Y from individual values of the p variables 


with which we are concerned. 
Changing to deviations from the mean, equation (1) becomes 


Yo = bx + Daxa +°° + + box, 2 
and the error in estimating a single value of Y will be 
у= byxy Ваа" * *— by, 3 


Following the same procedure as for simple regression, our problem is 
to find values of 5, ba, * * * , b, that will make the sum of the squares of 
the errors of estimation a minimum. Thus it is required to minimize 


X(y— by — 5—7 ° *— Бух)” 4 
and this leads by the method of least squares to р simultaneous equations, 


which are given below for 3 independent variables, and can be extended 
easily for the number of independent variables required. 


bx + baxa + byXxyxs = Ххіу 
b,Exyxy + Хх + byYXxyxs = Xx 5 
ҚҰхұху + byXxaxs + b, Lx? = хау 


The solution of these equations gives the values of by, bs, ба to be inserted 
in the prediction equation 
y, = bx + baxa + буа 


It should be noted that the partial regression coefficients are represented 
in the equation by an abbreviated notation. Thus to write b; in full we 
should write 5,55, showing that this is a partial regression coefficient 
for Y on Ху, with the variables X; and X, held constant. 

3. Meaning of partial regression. Although we have said that the 
regression coefficient Б,л.зз represents the regression of Y on X, with X, 
and X, held constant, it is possible to put this relation in a more direct 
fórm. Suppose that we have 2 independent variables X, and X, and 
wish to obtain the partial regression coefficient Әә. We сап determine 
the simple regression equations 


Jy. = bx, and . д, іа 
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From the first equation we can obtain a series of estimated values of Y. 
Then from the second equation we can get a series of estimated values of 
Ху. The deviations of the actual values of Y and X, from the respective 
regression lines represented by 


и = y — В,» and y = ху — Бо 


are both independent of X,; hence the required regression coefficient 
will be b,,. This сап be taken as the starting point in the derivation of 
partial regression methods. It leads by substitution and algebraic 
manipulation to the values of b obtained by solving the normal equations. 
Thus buy = Әә. This procedure can be built up for the derivation of 
regression equations with p independent variables. 

4. Calculation of partial regression coefficients. It is obvious that 
equations (5) can be solved by a direct application of elementary methods 
involving successive elimination of the unknowns. These can be found 
in any good text on algebra. A tabular method for the solution of the 
normal equations, known as the Doolittle method, is very convenient, 
and more recently Dwyer [2] has summarized and related all available 
methods and has presented what is referred to as the abbreviated Doolittle 
solution. Since this method reduces time and labor to the minimum, we 
shall use it here. The work is given in Example 8-1.* 

5. Example 8-1. Calculation of partial regression coefficients. The 
following data were obtained in a study of the relation between quality 
characteristics of wheat flour. Data were taken on 139 strains of wheat. 
The characteristics studied were loaf volume (Y), protein content (Xi), 
flour yield (X,), and crumb color score (Хз). In this example loaf volume 
is taken as the dependent variable. This is not quite satisfactory in 
connection with crumb color, as actually it is variations in the loaf volume 
that influence color and bring about the correlation, and not the variations 
in color that affect the volume. 


Х(хіу) = 69496 Ху) 


114245 — (хх) = — 20.834 

У(хьу) = 79.648 E(x?) = 94717 Х(хх)= 27276 

Х(ху)- 407.00 (х2) = 41423 Х(х,)-- 354.00 
E(x) = 630.00 


* Crout [1] has demonstrated а method that appears to give maximum abbreviation, 
It is demonstrated in the Appendix. 


TABLE 8-1 


CALCULATION OF PARTIAL REGRESSION COEFFICIENTS 


1 947.17 — 20.834 272.16 694.96 141 аа ат 


2 41423 — 354.00 79.648 2 аң. ав бт 
3 630.00 407.00 3 PE 
4 9747 — 20.834 27276 69496 d'ad бам Са Мак 
5 10 — 0.021996 0.28797 0.73372 5 10 by bn br 
6 413.77 — 348.00 9494 6 дал сада, ТЕАТ 

1.0 — 0.84105 0.22944 7 10 Баа Б 

258.77 286.72 8 азаа бв 

1.0 11080 9 baia бузла 


10 044019 1.1613 1.1080 10 Буз broeis Dysna 


Check: 
| Я 
| (947.17 x 0.440 19) + (— 20.834 x 1.1613) + (272.76 x 1.1080) = 694.96 


| Line Instructions 
| 1103 Enter sums of squares and products. 
і 4 Copy line 1. 


5 Divide each entry in line 4 by ац. 

6 Фа = dag — Anbar asz = asa — Azibar ара = aya — Ariba 
1 Divide each entry in line 6 by аз. 

8 “33.12 = 433 — дра — da.14. Aysu = Ars — ауд — @үзлбзт1 
9 Divide each entry in line 8 by аззалз. 
10 Reverse solution. 


bys copied from line 9 
| Әуе = аа” Балбүзіз 


Буаз = Ву Буузаа — bob yas 


j Check: 
agb yin + Ab yeas + азбүзеа = ар 


The sum of squares for regression is given by 


| 

| ьһ®хуу + byXxgy + 5Б,Ххау 

| in which Бу. is abbreviated to 5,, etc. The figures are 

і (0.44019 х 694.96) + (1.1613 x 79.648) + (1.1080 x 407.00) 


305.91 + 92.50 55 450.96 = 849.37 
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The normal equations corresponding to equations (5) are 


b,(947.17) + ba(— 20.834) + b(272.76) = 694.96 


b,(— 20.834) + 541423) --Ь(- 354.00) 79.648 


\ 


(272.16) + ba(— 354.00) + 5,(630.00) = 407.00 


To solve these equations we shall set them up as in Table 8-І. The 
portion of the table on the right represents the known values in terms of 
symbols in which the notation is in accordance with Dwyer [2]. Below 
the table are line-by-line instructions for carrying through the calculations. 
The calculations in Table 8-1 enable us to set up an analysis of variance 
for the multiple regression. 


55 DF MS 
Regression 849.37 3m 283.12 
Deviations from regression 293.08 185: AR pel 2.1710 
Total 1142.45 138 п- 1 


Since the value of F will be very large, there is no object in going farther 
with this test. It will be more informative to study the significance and 
possibly the fiducial limits of the partial regression coefficients. The 
method is given in Section 8. 

It should be noted that the multiple correlation coefficient is given by 


849.37 
Ғұлоз = ЕШ = 0.8622 


and its test of significance is given by the analysis of variance above. 

6. Calculation of partial regression coefficients for different sets of values 
of the dependent variable. It happens frequently in a partial regression 
analysis that the investigator has more than one set of values of the 
dependent variable. In an example given by Fisher [3] of the effect of 
longitude, latitude, and altitude on yield, these would remain the same 
for a given set of locations, but the yield would change each year. Itis 
desirable in such circumstances to be able to calculate sets of partial 
regression coefficients with minimum labor. Fisher [3] has shown how 
this can be done by the calculation from the dependent variables of a set 
of statistics commonly known as the Gauss multipliers. There are other 
important applications of the Gauss multipliers that we shall refer to later. 

The proposition is illustrated by setting up the normal equations in the 
form given below, for 3 independent variables. The student can very 
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easily take these as a basis for writing the equations for more than 3: 
independent variables. 


Cy UX? + CXXX + CygUxyX3 = 1 
са Ух + cy xy! + сұ,Ххах = 0 6 


спХхухз + CXXX + салла) = 0 


cax + са Ххх + сәу AXX = 0 


са DX4Xp CX. + CXXX = 1 7 
Cg EXXa + са УХаХ + Ex = 0 
сах + сұ ху + Xxx = 0 


Cg LX Xp + CX + сә, Хх%Ха = 0 8 


са EX Xg + Сә УхаХа + сұл) =1 


The solution of these equations gives the matrix of multipliers: 


Cn Ca, Cay Where Ca = Cyg, etc. 
n Сар C33 
and the regression coefficients can then be obtained from 
b, = cy Xxjy + сұ хау + з хау 
by = салу + Ca Exo + сауХхау 9 
by = EX Y + caxa) + саУхау | 


This shows that for апу one set of values of У we have to obtain merely 
Уху, Ххау, and Ххау. 

This method is applied in Example 8-2 to obtain the same set of 
regression coefficients as in Example 8-1. 

7. Example 8-2. Calculation of partial regression coefficients using the 
Gauss multipliers. The three sets of normal equations given in (6), (7), 
and (8) can be solved simultaneously in tabular form with the abbreviated 
Doolittle technique? The procedure is shown in Table 8-2. Тһе same 
rules of calculation apply as in Table 8-1, but a further table of symbols 
with instructions is given for convenient reference. 
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TABLE 8-2 


CALCULATION OF GAUSS MULTIPLIERS—ABBREVIATED DOOLITTLE SOLUTION 


272.16 1.0 


1 947.17 - 20.834 0 0 

2 414.23 — 354.00 0 10 0 

3 630.00 0 0 1.0 

4 947.17 - 20.834 272.76 1.0 0 0 

те Ө — 0.021 996 0.28797 0.001 055 8 0 0 

6 413.77 — 348.00 0,021 996 1.0 0 

7 1.0 — 0.84105 0.000 053 160 0.0024168 0 

8 258.77 — 0.269 47 ` 0.841 05 1.0 

9 1.0 — 0.001 041 3 0.003 250 2 0.003 864 4 
10 0.001 3376 - 0.00082267 — 0.001 041 3 
11 0.005 1504 0.003 250 2 
12 0.003 864 4 
Check: 


(0.001 337 6 x 947.17) + (— 0.000 822 67 x — 20.834) + (— 0.001 041 3 х 272.76) = 1.0000 


Regression coefficients can now be obtained using equations (9). 


b, = (0.001 337 6 x 694.96) + (— 0.000 822 67 x 79.648) + 
(— 0.001 041 3 x 407.00) = 0.4402 


by = (— 0.000 822 67 x 694.96) + (0.005 1504 x 79.648) + 
(0.003 2502 x 407.00) = 1.1613 


bs = (— 0.001 041 3 x 694.96) + (0.003 250 2 x 79.648) + 
(0.003 8644 x 407.00) = 1.1080 


The significance of the regression coefficients can be determined by the 
t test, using the formula for the standard error given in equation (10), 
Section 8. From the analysis of variance, Example 8-1, we have 
5° = 2.1710; thus s = 1.473. 


зь, = 1473 0.001 337 6 = 0.0539 
5p, = 1.473 0.005 150 4 = 0.1097 


Sa, = 1.473 V/0.003 864 4 = 0.0916 
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TABLE 8-3 


CALCULATION OF GAUSS MULTIPLIERS—INSTRUCTIONS FOR USE OF 
ABBREVIATED DOOLITTLE SOLUTION 


а а аз 1.0 0 0 
2 аз аз 0 Lo n 
Аза 0 0 1.0 
аһ а аз dy 0 0 
5 1 by ba еп 0 0 
431 4321 dua СЕД 0 
1.0 bsza fua бізі 0 
asa биз e бз 
1.0 fuas 12l2 бізі? 
10 Сп Сіз Сіз 
11 Ca Cos 
12 C33 
Instructions 


1,2, 3 Enter sums of squares and products. 
4 Copy line 1. 
Divide each entry in line 4 by ац. 


S 
6 Ада = dog — Anbar ауа = daa — anb 
7 Divide each entry in line 6 by 422-1- 

8 


dan = 0— duba 


азула = азз — адри — 2.1321 dasa = 0 — dbs — Фалбал 


for зле and 43.12. 


9 Divide each entry in line 8 by 433-12- 
10, 11, 12 сп = Феп + Фаёпа + ЛЕГЕ 
C= плева + diastisas 
Cio dyyastis2 


Cas = @злёза + Фзлзезлаг 
C23 = dyss€1512 


Css = di3-12613-12 
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etc., 


etc., 


142 PARTIAL AND MULTIPLE REGRESSION AND CORRELATION [8] 


8. Testing the significance of partial regression coefficients. The Gauss 
multipliers provide àn easy method for determining the standard errors 
of partial regression coefficients. These are given by 


yy, = sVen 
So, = 5У бэ 10 
Sp = SV cag 


where 3 
ج‎ "ap 
ОИ EE E 1 
n—p—1 
p is the number of independent variables, and X(Y — Y,)? is most easily 
obtained from 


X(Y— Ү, = Xy — Bx) ЬХху- іХхау 12 


The standard error of a difference between two regression coefficients 
can also be expressed in terms of the multipliers. Thus, for the difference 
between by, ... 4 and Бәл...» the standard error is 


SV Cg + с — 213 13 


A further application is in determining the standard error of an 
estimated value of Y such as Ул. For 3 independent variables, the 
standard error of Y, is given by 


1 
SE=s / = + eX? + Cope? + сваха? + 2 саха Ха + 2642 À 2essXaxg 14 
N 


where ху, xs, and хз are the deviations from the means of the particular 
values for which Y is estimated. Again, as in Chapter 6, it should be 
pointed out that the estimated Y is an estimate of the population value of 
Y for a given set of values of X;, X» and X;. 

9. Deletion of a variable. After completing a regression or correlation 
analysis, we may inquire as to the results that would be obtained if a 
certain variable had been omitted. An easy technique is available if the 
Gauss multipliers have been obtained. It involves the calculation of a 
new matrix according to the formulas given below. These formulas are 
a little cumbersome to express in general terms and are therefore given 
for specific cases that can be adapted to the general case. rey, represents 
a value in the reduced matrix. 
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Variables Y, Х,, Xs, Хз: to delete Хз, 


2 2 

с Сізс с, 

13 13Сәз 23 

rey = Cy — rey = Сіз гс» = Cag - 15 
Cas C33 C33 


Variables Y, Ху, Хә, Хз: to delete X, 


2 2 

сіз CioCos Cos 

Fey = Сп Tyg = Сіз TCgy = C33 16 
с» Соз Соз 


The calculations required to delete X; from Example 8-2 are given in 
Example 8-3. The same formulas can be applied to a matrix of Gauss 
multipliers arising from a correlation analysis. 

10. Example 8-3. Deletion of X, from regression analysis for Y, Xy 
Хо, and Хз. 


(— 0.000 822 67) 


0.005 150 4 = 0.001 206 2 


rey, = 0.001 337 6 — 


(— 0.000 822 67 х 0.003 2502) ^ (ду 522115 


тсз = — 0.001 041 3 — 


0.005 1504 
3 (0.003 250 2)? — 
re = 0.003 8644 — 1562 7 0.001 8133 


Check: 
(0.001 206 2 x 947.17) + (— 0.000 522 15 x 272.76) = 1.000 05 
b, = cy Uy + XXY 
= (0,001 206 2 х 694.96) + (— 0.000 52215 x 407.00) = 0.625 74 
-саХхіу + Саза 
= (— 0.000 522 15 x 694.96) + (0.001 8133 x 407.00) = 0.375 14 


Check: 
| (0.625 74 x 947.17) + (0.375 14 x 272.76) = 695.00 


SS regression = b,2xy + byXxgy 
= (0.625 74 x 694.96) + (0.375 14 x 407.00) = 587.55 


587.55 
22 Som ete. 
ints / 1142.45 Ap 
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11. Partial and multiple correlation. Referring to Section 3 above, we 
note that for 2 independent variables, X; and Х„ the simple regression 
equations 

J, = Баха and х, = baxa 


provide a series of estimated values of y for each value of x, and a similar 
series of estimated values of ху. The deviations of the actual values of 
y and x, from the respective regression -lines are independent of x». 
Representing these by 


u = y— byxs and y = ху — Бо 


Ty» is a partial correlation coefficient and can be represented symbolically 
by ғу)» 

In general, in correlation analysis we are not so much concerned with 
the distinction between dependent and independent variables as we are 
in regression analysis; therefore we shall refer to the variates as X1, X», Хз, 
etc., where the relation with regression problems is such that X, is taken 
as the variable corresponding to Y. For the normal equations we have 
therefore, with 4 variables and 3 unknowns, 


, ] , 
D" 12.34 Та? таға + Раб ааа = Ne 
, , , 
rabiosa + bisa + габ 14.23 = з 17 
, , , 
Fash’ 12.34 + Габ 13.24 + D' 14.23 = Ға 


where in the subscripts, 1 = X}, 2 = Х 3 Ху, 4 = Х, or in correspon- 
dence with regression analysis, 1 = Y, 2 Х,3- Х,4- Xs. 

These equations are derived from (5). Тһе regression coefficients are 
distinguished by a prime because they are actually standard regression 
coefficients. In other words, they are made standard with respect to the 
standard deviations of the variables involved. Thus 


? £x 
Боза = biosa Jz 
1 


А Ух? 
D308 = bigoa ЕН 
1 


The methods of calculating partial correlation coefficients are similar to 
those for calculating regression coefficients. The basic procedure is to 
solve the normal equations for the standard regression coefficients and 
then obtain the partial correlation coefficients from the relation 


18 


Е реГ 300. 
теме р = УБ лоза pb sis 19 
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From the normal equations in (17) it is obvious that the solution of 
one set of equations giving 0 12.34 D' ı32, and b'5455 will not enable us to 
calculate any correlation coefficients, as to give US Fiza» Misa» aNd Газ 
we require in addition Бола» лав and b'as: These can be obtained 
by rearranging the equations and solving again, continuing with such 
rearrangements until all possible values of the standard regression со- 
efficients have been obtained, but the whole problem is greatly simplified 
by the calculation of a matrix of multipliers which is the inverse of the 
r matrix. Thus the calculations with the r matrix on the left below gives 
the ¢ matrix as shown on the right. 


А , " Л 
тї "з з Ға сп Сіз Сіз Cu 
" / Л , 
Па Ға Tas Tes Сіз Саз Саз Cn 20 
А / " / 
гіз "з "зз "зд Сіз Саз € 33 Ca 
, " , " 
ru Ға Ға "а си Cu см Са 


From the c' matrix any of the partial regression or correlation coefficients 
can be calculated. Thus, it can be shown that 


, 

Сіз Сіз 

Бой у boys = — i. etc. 
Cu 22 


from which it follows that 


Сіз 
һәм Naa ge etc 
Спс 2 
] 
Гозал = E 
9344 == DD 
V C aC ss 


The significance of a partial correlation coefficient is tested by the 
calculation of ¢ from 


prp 22 


V1—r 


where r is the coefficient being tested, n is the number in the sample, and 
p is the number of independent variates. 

In Example 8-4 the abbreviated Doolittle method is used to obtain the 
matrix of multipliers for a problem with 6 variables. 

12. Example 8-4. Calculation of partial and multiple correlation 
coefficients. The total correlation coefficients in Table 8-4 were obtained 
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in a study [5] of the effect of the physical characteristics of wheat on the 
yield and quality of flour. 


TABLE 8-4 


SIMPLE CORRELATION COEFFICIENTS FOR THE RELATIONS BETWEEN 6 VARIABLES 


2 3 4 5 6 
1 — 0.4589 — 0.5612 — 0.3947 — 0.3123 0.6412 
2 0.3114 0.0429 0.2861 — 0.3190 
3 — 0.0655 0.1467 — 0.4462 
4 0.1882 — 0.3511 
5 — 0,3092 
1 = yield of fiour 4 = percentage immaturity 
2 = percentage of bran frost 5 = percentage of green kernels 
3 = percentage of heavy frost 6 = weight per bushel 


Tables 8-5 and 8-6 together with the line-by-line instructions give the 
complete details of the calculations. Further abbreviations in the method 
are given in the Appendix. 

13. Interpretation and significance of multiple correlation coefficients. 
We have noted that the multiple regression equation is 


Ye = by + boxy ++ + + + bx, 


where the dependent variable is Y and the independent variables are 
Xy Xa, ° + °, Xp The greater the prediction power of this equation, the 
closer the agreement between the actual and predicted values of y. Hence 
the prediction power of the equation should be measured by the simple 
correlation r,,. This actually is the multiple correlation coefficient, 
ry.» and obviously its minimum value is zero and its maximum value 
is + 1. It does not have a range from — 1 to + 1 as in the case of a 
simple correlation between two variables. 
In the analysis of the sums of squares of Y we have 


Dy = L(Y— Y, X(Y,— Ӯ)? 
or 


Ey = Ху? + Уу 


as in the case of simple correlation, but the corresponding degrees of 
freedom in this case are 


n—1=(n—p—1)+p 


TABLE 8-5 


ABBREVIATED DOOLITTLE SOLUTION FOR STANDARD REGRESSION AND PARTIAL CORRELATION COEFFICIENTS 

1 1.0000 - 0.4589 - 0.5612 - 0.3947 — 0.3123 0.6412 1.0 0 0 0 0 0 

2 1.0000 0.3114 0.0429 0.2861 - 0:3190 0 1.0 0 0 0 0 

3 1.0000 — 0.0655 0.1467 — 0.4462 0 9 10 0 0 0 

4 1.0000 0.1882 — 0.3511 0 0 0 1.0 0 0 

5 1.0000 — 0.3092 0 0 0 0 1.0 0 

6 1.0000 0 0 0 0 0 1.0 

Т 1.0 — 0.4589 - 0.5612 - 0.3947 — 0.3123 0.6412. 1.0 0 

8 1.0 — 0.4589 - 0.5612 - 0.3947 - 0.3123 0.6412 10 0 

9 0.7894 0.0539 — 0.1382 0.1428 - 0.0248 0.4589 1.0 0 
10 1.0 0.0683 - 0.1751 0.1809 — 0.0314 0.5813 1.2668 0 

11 0.6814 — 0.2776 — 0.0383 - 0.0847 0.5299 - 0.0682 1.0 0 

12 1.0 — 0.4074 — 0.0562 — 0.1243 0.7777 — 0.1001 1.4676 0 

13 0.7069 0.0743 - 0.1369 0.6909 0.1473 0.4074 1.0 0 

14 1.0 0.1051 - 0.1937 0.9774 0.2084 0.5763 1.4146 0 

15 0.8667 - 0.0948 0.1865 — 0.2002 0.0134 — 0.1051 1.0 0 

16 1.0 — 0.1094 0.2152 — 0.2310 0.0155 — 0.1213 1.1538 0 

17 0.5407 | - 0.4067 0.0296 0.2047 0.1822 0.1094 1.0 

18 10 - 0.7522 0.0547 0.3786 0.3370 0.2023 1.8494 
19 0.4067 - 0.0296 - 0.2047 - 0.1822 - 0.1094 2.7002 0.6069 1.0248 0.8177 0.1329 - 0.7522 
20 —0.1130 0.1913 - 0.0484 0.0718 - 0.1720 1.3522 - 0.0071 0.2426 - 0.2250 0.0547 
21 — 0.5492 - 0.1630 - 0.4324 0.0567 - 0.2264 1.7801 0.6437 0.0569 0.3786 
22 - 0.5757 0.0040 — 0.3616 - 0.0320 - 0.2127 1.4888 - 0.0844 0.3370 
23 - 0.4488 0.0052 - 0.1794 0.1664 — 0.0404 1.1759 0.2023 
24 — 0.2248 - 0.3795 — 0.3028 - 0.0492 0.2786 1.8494 
25 — 0.318 — 0.467 —0.408  — 0.074 0.337 
26 0.005  —0.171 0.178 - 0.034 
27 — 0.395 — 0.039 — 0.209 
28 0.064 - 0.203 
29 71.23456—0.678 “ — 0.137 


” 


CALCULATION OF PARTIAL AND MULTIPLE CORRELATION COEFFICIENTS* 


TABLE 8-6 


1 а, ад а а 9 Чы 1 0 0 0 0 0 

2 ds аз 4а аа айа 0 1 0 0 0 0 

3 аз аз аз аз 0 0 1 0 0 0 

4 ам а а |O 0 0 1 0 0 

5 ау 4% 0 0 0 0 1 0 

6 ав |0 0 0 0 0 1 

7 ац а а а йд а di 0 

eo ba ba bua ba ba fu 0 

9 Ja Әзәл Mga Әәл бөзү | dna da1 0 
10 1 ӛзі бал биі Әні | епа еза 0 

11 Аза Gisa Assa ааз | Фа dia disa 0 

12 1 Daas бөз Dove | епа бәз eisa 0 

13 Gws Aws Gas | Фаз dias diss Фаз 0 

14 1 Быз bors | ems бәз бза Физ 0 

15 аза Gesa | dus dioa disa duwa Фа 0 
16 1 bua | еца ёза бза ёна ева 0 
17 ау | dues das diss dias diss dies 
18 bus | ems баз eiss Cis Giss ёв 
19 Бы 5а ба Бы 5 Са Сла ci Cu Cis С 
20.84 бы ба Бы 5%% Са Са Сы Св Сн 
21 bua ба bas Б Ба Са Сы С См 
22 Уу ба Бы bas Б Са Ca Cus 
23 ba Ба Бы b's; Өз сы С 
24 ба bis bu bis Ру се 
25 ru Па hs "а ris Tis 
26 "ұлым = Gerber + Әзіз + Aes-abes-a Ға Tes Ғы б Ғы 
27 + gibus + Ф%забва Ға Ты іы "и 
28 tua. ls Те 
29 Ты ғ 
30 ee 


* Subscripts have been abbreviated, thus Без = bg5-1231- 
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Instructions for Calculation of the Inverse Matrix Yielding Standard Partial 
Regression Coefficients and Partial Correlation Coefficients 
Line 
1 to 6 Enter known values for normal equations. 
7 Repeat line 1. 
8 Divide each item through by ау. 
9 аза. = аз — Anbo, азва = 052 — anban, iri = 022 — адр» `` 5 
dig = 1 — 0 X bg 
10 Divide line 9 by азз.1. 
11 азза = азз — Фиби- 9з210з21» Aase = Ass — аца — вара, ° ° 5 
dia = 1— 0 X буа 
12 Divide line 11 by азз.з. 
13 das = ай — anbu — анада” Фззда» eres = Asa — аура — 
5q-1P 49-1 — Фараз С?» digg = 1— 0 X 2 


14 to 18 Complete calculations following systematic procedure outlined above. 


19 Cu = dunen + dalna + dielne F ° ° °F Ay seis 

сла = dya€igi + Фпабие T° °F yy 50105 

| саз dalis + ° ° duseios 
| etc. 


| (Omit calculation of regression coefficients at this stage.) 


20 Cag = Фалеёза + Фалез + Фаабаас ° ° ° + зе» 
| Ca = disatissa + Фазезз F ° ° H dantis 
etc. 
2 Cas = Фзебзз + desti +° ° ° + Фалез» 


etc. 


" 22 to 24 Complete calculation of multipliers. 


2 
Le — Cis 
25 nrus = 75 113-2156 = IS EE etc. 
CC 22 Cul 33 
/ / 
— Са — Om 
26 Гуз = Tee LÀ Toa-1356 = US س‎ etc. 
С зә зз C gaC 44 


27 to 30 Complete calculation of partial correlation coefficients. 


m ae E 
Return to line 19 b'a = 5, b'e = —M rts bó = = 
Свв Свв сп 
For these calculations, first obtain reciprocals of сц, Саз» ° ° 7» Cos and multiply these 
systematically by numerators to obtain the required quotients. A complete check is 
obtained by repeating the calculation of the partial coefficients from Гзз..-6 


= МЫ: Ба, etc. 
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where p is the number of independent variables. A test of significance 
of the multiple regression equation or of the multiple correlation co- 
efficient is given by calculating 

xy? —p—1 
au c ра 

р Xyj 

and the table of F is entered under п — p — 1 degrees of freedom. 

When the correlations only are known, F can be calculated from 


2 CN = . 
» ГКК, (? р 2 24 
[E Р 


23 


This arises from the fact that 


Ey? 
Pyme, ууя 25 
and 

>H 2 

1-5-0 9 26 
Putting (26) into the form 

» 2 

inn cles $ 27 


is also of value in an interpretation of the multiple correlation coefficient. 
As the deviations from regression approach zero, the coefficient approaches 
Ek 

14. Special applications. The analysis of variance provides a test of 
the significance of the additional information obtained in calculating 
multiple correlation coefficients. This principle was applied by Geddes 
and Goulden [4] in a practical problem in cereal chemistry. Correlations 
were first determined between loaf volume of wheat flour and the per- 
centage of protein. In later studies the protein was separated into two 
portions, peptized and non-peptized, and with these two portions as 
variables the multiple correlation for their combined effect on loaf volume 
was calculated. If the proportions of the two kinds of protein have an 
important effect on loaf volume, the multiple correlation should be 
significantly higher than the simple correlation for total protein and loaf 
volume. A method of comparing the two correlations would determine, 
therefore, the practical significance, for purposes of predicting flour 
quality, of knowing the amounts of peptized and non-peptized protein in 
addition to the total protein. 

If we let X, represent loaf volume, X, the peptized protein, X; the non- 
peptized protein, and X, the total protein, the corresponding simple and 


eee 
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multiple correlation coefficients are гу, and ro» The total protein is, of 
course, (X, + Хә), the sum of the two fractions. 

Assuming these correlations to be determined from 20 pairs of values, 
the sums of squares representing deviations from the regression function 
are proportional to (1 — /,) and (1— 73,45), respectively, and the 
corresponding degrees of freedom are 18 and 17. Тһе effect of bringing 
in more variables to estimate X, as in multiple regression is to decrease 
the sum of squares due to deviations from the regression function, but 
for each additional variable introduced 1 degree of freedom is lost, and 
unless the reduction of the sum of squares is more than proportional to 
the loss in degrees of freedom there is no gain in precision. An analysis 
may therefore be set up as follows: 


55 DF MS 
Deviations, regression of X, on X1 tnm 18 
Deviations, regression of X, and X, on X; 1 — rss 17 (1) 
Additional degree of freedom (1— r5) — (0. — ra» 1 (2) 


Applying the F test to the mean squares (1) and. (2), taking (1) as the 
error, we can determine the significance of the gain in information due to 
the addition of another variable. 

In one actual experiment for a series of 20 flours from No. 2 Northern 


wheat, rı, = 0.511 and г» = 0.732. The analysis gives 


55 рЕ MS F 1% Point 
1—ryy 0.738879 18 
1— yas 0464176 — 17 0.027 30 
Difference 0.274 703 1 0.2747 10.06 8.40 


15. Exercises. 


1. R. A. Fisher [3] gives the following data for the effect of longitude (X), latitude 
(Xy), and altitude (Хз) оп yield Y. 


Ly? = 1786.6 
У(хуу) = 11374 Убх) = 19341 EGux) = — 772.2 
Хбау)- 5929 (x) = 2889.5 (хх) = 9241 
Хау = 891.8 Уху) = 17508 Ex) = 119.6 
bama азо Dass чы =ош) bua = 03079 


Calculate the partial regression coefficients for these data (a) by the direct method as 
in Example 8-1 and (6) by obtaining a matrix of multipliers as in Example 8-2. 
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2. Test the significance of the partial regression coefficients obtained in Exercise 1, 
applying formulas (10). Devise a method of determining the fiducial limits of bg. 

3. Test the significance of the difference between b, and b in Example 8-2. 

4. From the data of Exercise 1 obtain the correlation coefficients ^з, Рз, F14, Рз, "24 F34, 
substituting X, for Y, X, for Ху, X, for Ху, and X, for Хз. Then solve for the partial 
correlation coefficients as in Example 8-4. 

5. Calculate from Exercise 4 the multiple correlation coefficient гү. and test its 
significance. 

6. For n’ = 40, determine the multiple correlation г. that is just significant. 

7. Determine the significance of the gain in information through the calculation of 
multiple correlations in the examples given below. For each comparison, state your 
conclusion in words. 


п = 40 та = 0.7643 там = 0.8031 
n = 62 та = 0.8744 тамы = 0.9664 
n = 20 п» = 0.7621 "аа = 0.7635 
п = 20 та = 0.7316 қамы = 0.7329 
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СНА РЕЗЕРВ 


The Analysis of Covariance 


1. Principles of covariance analysis. In the analysis of variance for 
a single variable, the variation can be partitioned into different com- 
ponents. Having a two-way classification, we can sort out the variance 
components into rows, columns, and residual. In a randomized block 
experiment we can sort out the variance components attributable to 
blocks, treatments, and error. The same is true for the correlated 
variability or covariation of two variables, and the mechanism for sorting 
out the covariance effects is known as the analysis of coyariance. 

In order to think in terms of actual values, we can assume that the 
variables are yields of grain and straw from cereal plots. Assuming a 
test of n treatments conducted in r randomized blocks, we would have 
the following sets of values, where Y represents the yield and X the straw. 


Treatment Block 


1 PST RED n Mean 


1 | Ха Yu Ха Yio’ ° Xin Yin Ха Yn 
2 | Xn Yn Ха Үз * Xan Yon Хь Ys 
Bok 1] : PME 

r | Xa Yn Ха Үз" Xin Үт Xy Yor 


Іле Ха Үй Ха Tao Xin Yo AE 
General 


Mean 


Note that there are п treatment means of X and Y, and r block means of 
X and Y, and each of these sets of paired values can determine a coefficient 
for the regression of Y on X. Also there is a set of rn values representing 
individual deviations of X and Y from the general mean, less the effects 
due to block and treatment means. 
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Specifically, these are 


Qi — Xa- Ха + X) (Yu Р Fa Y) 
(Ха- Хы- Xa + X) (Yi — Y Ра + Y) 


Xm Xy — Xin + 2) (Y — Yo — Yin + Y) 


This set will also yield a regression coefficient of Y on X. 

In connection with the problem of sorting out regression effects the 
first question we ask is: What are the expected values of the regression 
coefficients, which we shall represent by b, for block means, b, for treat- 
ment means, and 5, for the residual or error term? Let us suppose that 
we have a large population to be sampled in which Y is normally dis- 
tributed for each level of X and in which the regression coefficient of Y 
on X is fy x. Then, if the pairs of values of X and Y are selected at 
random from this population and arranged in the two-way classification 
for blocks and treatments, it can be shown that the expected value of 
b, b, and b, is Вух. That is, if in an actual experiment b, is considerably 
larger than b,, which in any event has the expected value Êy x, we аге 
led to the conclusion that the data are heterogeneous for regression 
effects, or, in other words, that the regression effects for the treatments 
did not arise merely from.random sampling of the regression in the popu- 
lation represented by the error term. 

The sorting out of regression and correlation effects is, therefore, an 
obvious application of the covariance analysis. A regression coefficient 
measuring a total effect is not capable of a definite interpretation, nor can 
it be submitted to valid tests of significance if heterogeneity is present. 
Suppose that we have a total regression of yield of barley on quantity of 
malt extract for 20 plots located at each of 6 stations. The sums of 
squares and degrees of freedom can be sorted out as follows 


SS DF 

Regression 1 
Deviations from regression 118 
Total 119 


and the regression may appear to be highly significant, but, if it is sorted 
out into its components, bétween and within stations, we have two 


analyses: 
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Between Stations DF Within Stations DF 
Regression 1 Regression 1 
Deviations from regression 4 Deviations from regression 113 

Total 5 Total 114 


and, if the greater part of the total regression is caused by the station 
means, it is quite possible that there will be no significance in either case, 
as the between-station regression has to be tested against an error based 
on only 4 degrees of freedom. 

A total covariance of the type mentioned above is sometimes referred 
to as containing a spurious effect. This is taken care of in the covariance 
analysis, and the so-called spurious effect is not only removed but 
evaluated as a distinct component of the total. 

2. Applications of the covariance method to the control of error. One 
of the most important applications of the analysis of covariance is in the 
control of errors that arise at random throughout the experiment and 
cannot be taken care of by replication. For example, the number of 
plants per plot for such crops as mangels and sugar beets may occur at 
random throughout the experiment and, so far as they affect the yields 
of single plots, add to the experimental error. Correction of the yields 
on the basis that yield is directly proportional to the number of plants is 
a frequent practice, but it is not difficult to demonstrate that yield is 
rarely if ever proportional to the number of plants per plot and that such 
an adjustment is likely to exaggerate the yields of plots in which plants 
are missing. Correction on the basis of the exact relation between yield 
and number of plants as indicated by the data is, however, perfectly 
justifiable, and the method of making such a correction is a natural 
development of the covariance technique. Numerous applications of the 
same method will undoubtedly occur to workers in other fields. 

In order to demonstrate the control of error by the covariance method 
we shall represent a covariance analysis algebraically as follows, in which 
the experiment is presumed to be a randomized block field plot test. 


DF for 
DF ZC Хоу) 209 Бе Ба) хи) уд) 


Blocks P 77855 Co 
Treatments 4 A AB с b = BA bB, С-В, 4-— 1 
Error n Ас “ЕҰУ Сұ by = ВА; бі C,— 5B, тей 


TLE Rg A BIC Ci bi PBA, bB, CQ-bhB п%4-1 
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In the column headings, 2X yq) indicates a sum of squares for Y 
adjusted by the regression coefficient in the same line. 

The calculations are complete in each line of the table. The regression 
coefficient is B/A, and the adjustment іп the sum of squares for Y is bB 
or B?/A. In the last line we are considering only treatments and error so 
that А, = 4; + Ao, В, = В, + B», and С, = С, + Cs 

The second step in the procedure is indicated as follows. 


DF ss MS 
TE n+q-1 C,— В, 
E n—1 С, В, va 
o J a n ІІІ 
T (difference) q С, + b3B,— b,B, у 
Т gal C— b.B, ys 
(b, — b;) (difference) 1 b, B, + baBa— 58; » 


The first sum of squares for treatments is obviously represented by 4 
degrees of freedom, the expression for degrees of freedom corresponding 
to C, + baBa— b,B, being q + 1— 1. The second treatment sum of 
squares is written down from the first table and is represented by q— 1 
degrees of freedom, as it has been adjusted by the treatment coefficient. 
On subtracting the second treatment sum of squares from the first, we 
have a sum of squares given by В) + В, — Б.В and it is not difficult 
to prove the following equality: 


А.А 
В ae EAS 24 — bg ар by 
byB, + В, — Б.В, = b? A + БА = БА, ia 2) 


It follows that when b, = b this sum of squares is zero, and that a test 
of significance of the corresponding variance y, is a test of the significance 
of the difference between the error and treatment regression coefficients. 
The test of significance of the treatment differences involves a com- 
parison of the variances Уз and у. The fact that v, may contain а 
significant effect due to bı — ba does not vitiate the meaning of the test, 
as such an effect is obviously due to some factor characteristic of the 
treatments. In respect to yield and number of plants per plot, the treat- 
ment regression coefficient b, may be higher than 5;, and this will contribute 
to the significance of »,; but b, represents the regression of yield on 
number of plants within treatments and may be taken as a true measure 
of the effect of number of plants on yield. If the treatment regression 
coefficient is higher than ba, this probably reflects an additional genetic 
relationship and one that should contribute to the significance of the 
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differences between the treatments. A further test may be applied, 
however, to уз, and by a comparison of the significance of уз and v, а 
complete picture of the variety effects is obtained. The value of such an 
analysis, if, for example, number of roots has a significant effect on yield, 
is that the error variance will be reduced with a consequent increase in 
the significance of the treatment differences, if such differences exist. If 
the analysis of the unadjusted yields shows significant differences when 
the adjusted yields do not, this simply means that the original differences 
were due to number of roots and not to the effect of the treatments as 
measured by average yield per root. 

R.A. Fisher [5] has pointed out that an appropriate scale for measuring 
the effectiveness of methods of reducing the error is the inverse of the 
variance. This is sometimes called the invariance, or the quantity of 
information, and is represented by 1/у. In measuring the reduction of 
error by means of the covariance analysis, this scale is particularly 
valuable. If the original error variance is twice as large as the final error 
variance obtained by adjusting the sum of squares for the associated 
variable, in the original form about twice as many replicates would be 
necessary to give the same accuracy as for the adjusted results. One 
should not reason from this that the significance of the differences between 
the treatments will always be increased accordingly, as it must be remem- 
bered that at the same time differences between the treatments due to the 
associated variable are also being removed. 

The test of significance having been applied as outlined, the next step 
is to make corrections to the variety means. Since the regression co- 
efficient in the error line may be considered as representing the actual 
effect of number of roots on yield, this regression coefficient should enter 
into the formula for making corrections. The corrected means should 
then be the best possible estimates of what the means would have been if 
they had not been affected by variations in number of roots. The regres- 
sion equation will be of the form 


Y, = fa— (Ха X) 1 


where X,, is the mean of X for treatment 1, Үдіз the mean of У for the 
same treatment, bys is the regression of Y on X in the error line, and Y, 
is the estimated mean of the variety. 

Formulas for the standard error of differences between corrected means 
have been given by Wishart [12]. To compare two corrected means such 
as Y, and Y, the standard error of the difference between two means is 


ИС 2] 
| Е Eis Ay 2 


ge om =o. WE 
NUT 3. 
T as 
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where 5? is the mean square in the error line of the analysis of covariance 
table (for example, in Table 9-2 it will be 19.56/8 = 2.445), A, is the sum 
of squares for X in the same line, ғ is the number of replications, and 
(f, — 2,) is the difference between the two means in the two expressions 
for calculating Y, and Y, from equation (1). Thus 


Ү,- 7,—6,(%,—%) ам Y= f,—b,(%,— X) 


In comparing two means corrected for two variables X, and Х„ we 
calculate the standard error of a mean difference as follows. 


PELE LEL 
T AB P 


where A and В are the sums of squares in the error line for X, and Xy 
P is the sum of products for Y, and X, in the error line, u = (Fj, = Fio). 
difference between Y, means, and v = (£,, — f,,), difference between 
А, means. 

Finney [4] has suggested а simple formula for the variance of the 
difference between two adjusted means. The exact formula given above 
is different for each comparison, and Finney's suggestion is to compromise 
by incorporating the average value of (f, — f)", which can be equated to 


24 
rt- 1) 
where A, is the treatment sum of squares for X, r is the number of 


tions, and г is the number of treatments. The general formula for the 
variance of a difference simplifies to 


25 ^ 
Р | " ағы 
where A, is the sum of squares for Х in the error line. 
3, Testing the heterogeneity of a series of regression coefficients. The 
analysis of covariance provides a unique method for testing the significance 
of the differences between two or more regression coefficients. Following 
the same symbolism as in the previous section, the procedure is as given 


4 


on p. 159, , 
The last sum of squares may be shown to be 
È A Alb, — b} 
зева [Ака 6 


veu pur 


(5) CALCULATION OF SUMS OF PRODUCTS 1” 
where b, and b, represent all possible pairs of the regression coefficients, 
and A, and A, all possible pairs of the corresponding sums of squares of X, 

The comparison of v, and v, by means of the F test furnishes, therefore, 
the required test of the heterogeneity of the regression coefficients, 


Group DF Xo Loy) Юу) be Шо) Шу) у) 
! + А в С, he BA hh Gmbh 4-1 
2 4 ^ h о à- N^ ық Cony 4-і 
Р q Ay B, C, BA, 0,8, Com ы, 4-1 


Total м A M QG hnb M, С, = 0,8, ру і 


Тен of Significance 
DF ху) мз 
Total м-і С, = 68, 
Within groups ”“-! $e М) у (error mean square) 
= 7 Е v, (due to differences bereeen 
Difference (pe) Sum bt, sionti 


4. Selection of variable for error control. A certain amount of care ін 
necessary in selecting variables that are to be used in the control of error. 
A simple rule is not to choose any variable that might itself be affected 
by the treatments because with such a variable the exercise of control 
results in the removal of part of the treatment effects we wish to measure, 


also be a convenient variable for error control, and possibly for more 
than one year, but this variable would not be suitable after the time when 
the treatments could affect the size. Similarly, with all error control 


5. Calculation of sums of products. The sums of products for the 
various components in an analysis of covariance can be calculated by а 
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procedure exactly analogous to that for calculating sums of squares. 
Assume a set of values for n treatments in r replicates, as in the following 
table. 


Treatment 5 
Replicate 
1 2 Сир n Total 
1) Xa Yu Ха Үз ССС Xin Yin ХХ, ХҮһ 
2| Xn Ya Ха Уа 77 Xan Yan ХХ, E Ypa 
Replicate : 

"| Ха Yn Ха Үз СС: Xin Yon | ХХ, E Yor 

Treatment — ЕТЕ | 

Total EX,ZEY, ZX ZY: °: DX, ZY, б, Gy 
Grand Total 


The sums of products аге 


nr nr с G. 
Total (ху GY) e 
1 1 nr 
- n (XX,XY, G,G 
Treatments DD NES e кек 
1 1 r nr 


d L(xXX 
Replicates пу (у) = У (amm бо, 
1 1 п nr 


Error = Total — Treatments — Replicates 


where X, = X,— Y, y, Y, .Y, X £07. Y, y = ¥,— Y 

6. Example 9-1. Analysis of covariance. This one example will 
demonstrate some of the important applications of the covariance analysis. 
Data are taken from Bartlett [1]. The data were presented originally in 
a demonstration of the application of the covariance technique to the 
estimation of a missing value. The results obtained are given in Table 
9-1. In Bartlett’s analysis it was shown that the results in replicate E 
for ration 3 were erratic, and estimated values were substituted. Here 
the analysis is performed on the original values. 

In this experiment cows were tested during a control period in order to 
obtain their potential milk yield. The results for this control period 
provided a basis for error control through the analysis of covariance. 

The analysis of covariance is set up in Table 9-2. It is recommended 
that sums of squares, products, and totals be obtained by treatments as 
they will be required in this form for a test of heterogeneity. 


(6) 
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The calculation of sums of products is as follows. 


Total 
Ration 1 1937 
Ration 2 1405 388 x 293 
Ration 3 1861 Un as piste 
Ration 4 2068 
7271 
Replicates (111 x 81) + (93 х 72) + 42 388 х 293 — 92,50 
4 16 
Ration (97 x 79) + (85 x 66) + 388 x 293 24.00 
4 16 
Error 165.75 — 92.50 — 24.00 = 49.25 


After also calculating the sums of squares, the analysis of covariance 


can be set up as in Table 9-2. 


TABLE 9-1 

MEAN YIELD OF MILK IN (POUNDS PER WEEK)/10 FOR CONTROL AND 

EXPERIMENTAL PERIODS, DAIRY Cow FEEDING TRIAL—CONTROL PERIOD 
YIELD = X, EXPERIMENTAL PERIOD YIELD = Y 


Ration Total 
1 2 3 4 
X Y X Y X Е x Y X T 
B 28 21 22 17 27 18 9477 %25 111 81 
Replicate С. 24 21 22 16 26 20 21 15 93 72 
P D 20 16 18 16 19 14 21 16 78 62 
Е 25 21 23 17 31 19 27 21 106 78 
Total 97 79 85 66 103 7i - 103 77 388 293 
TABLE 9-2 
ANALYSIS OF COVARIANCE, DATA OF TABLE 9-1 
DE x xy y 8. Баху Оу) DF 
Replicates 3 163.50 92.50 52.69 
Treatments 3 5400 24.00 26.19 0.44444 10.67 15.52 2 
Error 9 73.50 4925 52.56 0.67007 33.00 19.56 8 
Treatments + Error 12 127.50 73.25 78.75 0.57451 42.08 36.67 1 
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Notice that the last line is obtained by adding the sums of squares and 
products for treatments and error. The column head Zy; is obtained 
by subtracting the previous column (sum of squares for regression) from 
the sum of squares for Y. 

The next step in the analysis should be to determine the significance 
of the regression in the error line. This must be significant in order to 
make adjustments worth while. We have the simple analysis: 


ss DF MS F 
Regression 33.00 1 33.00 13.5 
Error 19.56 8 2.445 


Since F is much greater than the 1% point, we are assured of the signifi- 
cance of the regression of experimental yield on control yield, and 
adjustments to the treatment means should be worth while. 

We still have to test the significance of differences between treatment 
means. The test is carried out in Table 9-3. 


TABLE 9-3 


TEST OF SIGNIFICANCE OF TREATMENT EFFECTS 


DF 55 MS F 5% Point 
Treatments -- Error 11 36.67 
Error 8 19.56 2.445 
Difference = Treatments 3 17.11 5.703 2.33 4.07 
Treatments (adjusted by 
treatment regression) 2 15.52 
5,- b, 1 1.59 1.59 


The reduced sum of squares for treatments + error is brought down 
from Table 9-2, together with the reduced sum of squares for error. The 
difference is a sum of squares for treatments that can be tested against 
the reduced error. The further step carried out in the last two lines is to 
obtain a test of the significance of the difference between the treatment 
and error regressions. 

For testing the significance of differences between pairs of adjusted 
treatment means we can apply formula (5), suggested by Finney [4]. 
Here the variance of a difference between two means is taken as 


2 х 2.445 (1 Ф 54.00 
4 73.50 х 3 


) 1.522 


and the standard error of a mean difference = 4/1.522 = 1.23. 
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The analysis of this experiment brings out some interesting points. In 
the first place, the reduction in the error from adjustment by the milk 
yields in the control period is quite marked. The relative precision of 
the results before and after adjustment is approximately 

5256 1956259 
9 8 
indicating a very decided gain in information. 

The F test of the ration effects before adjustment yields an F value of 
1.49 whereas after adjustment the F value is 2.33. Neither of these is 
equal to the 5% point, but there is sufficient improvement after adjustment - 
to indicate that the method is sound. Owing to the large gain in precision 
one might expect a greater effect on the significance of the rations, but in 
this connection it must be remembered that a portion of the ration effects 
in the unadjusted data must correspond with true ration effects and this 
is removed by the adjustment. 

7. Example 9-2. Testing for heterogeneity of regression. If the sums 
of squares and products are determined for each ration, an analysis such 
as that of Table 9-4 can be set up. 


TABLE 9-4 


Test OF HETEROGENEITY OF REGRESSION BETWEEN ‘TREATMENTS 


Ration DF x: xy ж ӛш „Уху Уур DF 
1 3 32.75 21.25 18.75 0.6488 13.79 4.96 2 
2 3 14.75 2.50 1.00 0.1695 0.42 0.58) 15.36 2 
3 3 74.75 32.75 20.75 0.4381 14.35 6.40 2 
4 3 114.75 85.25 64,75 0.7429 63.33 1.42 2 


Total 12 23700 141.75 105.25 0.5981 84.78 20.47 1 


It will be noted that although the total reduced sum of squares 
accounts for 11 degrees of freedom, the reduced sums of squares for 
each ration account for only 4 х 2 = 8 degrees of freedom. The 
remaining 3 degrees of freedom. represent differences between the 4 
regression coefficients. This enables us to set up the following analysis. 


DF SS MS Е 
Total 11 20.47 
Rations 8 13.36 1.670 
Difference 3 7.11 2.370 1.42 


There is very little evidence of significant heterogeneity. 
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8. Exercises 


1. The data in Table 9-5 are grain and straw yields given by Eden and Fisher |3) 
for 8 manurial treatments and 8 replicates of each. Take the straw weight as a con- 
comitant variable in an analysis of covariance and test the significance of the adjusted 
grain yields. From the results obtained and from other characteristics of the data 
decide as to the suitability of straw yield for the adjustment of grain yields. 


TABLE 9-5 


THE MEAN GRAIN AND STRAW YIELDS FOR EACH OF 64 PLOTS, THERE BEING 
8 DIFFERENT MANURIAL TREATMENTS AND 8 REPLICATES OF EACH 


Block 1 Block 2 Block 3 Block 4 
Treatments i | 
Straw Grain | Straw Grain | Straw Grain | Straw Grain 
A 242 620 | 321 646 | 261 681 317 644 
B 267 644 | 382 745 201 542 | 316 711 
с 215 523 | 330 713 298 686 381 688 
D 212 601 292 693 265 685 255 714 
E 322 664 370 693 284 666 323 516 
F 200 514 261 637 259 697 361 710 
G 260 550 318 708 266 663 340 673 
H 203 521 275 661 207 594 331 730 
Block 5 Block 6 Block 7 Block 8 
Treatments 
Straw Grain | Straw Grain | Straw Grain | Straw Grain 
A 255 706 331 615 216 552 295 726 
B 280 705 285 637 200 543 309 646 
© 300 692. 294 612 256 635 284 748 
D 238 699 309 697 283 701 324 746 
E 232 656 393 663 351 657 363 683 
Е 234 633 258 595 306 697 376 712 
G 362 671 400 626 276 655 385 671 
H 229 625 266 644 276 745 328 747 
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CHAPTER 10 


Non-Linear Regression 


1. Analysis of linear and non-linear regression components. Іп 
Chapter 5, Section 12, it was pointed out that sums of squares representing 
a set of mieans could be broken up into sums of squares representing 
individual degrees of freedom. By following a simple rule it was shown 
that, having decided on a given comparison, other comparisons could be 
selected such that all comparisons would be orthogonal. A situation of 
particular interest arises when we require a comparison that represents 
what is known as the /inear effect. Suppose, for example, that we have 
the following results, representing a response to four levels of treatment. 


Levels (X) 1 2 3 
Response (Y) 9,4 12.6 8.7 6.3 


The results show that there is an increase in response up to the second 
level and then a gradual decrease. The linear response will be simply 
that portion of the total that can be represented by a straight-line regression 
equation of the form Y=a-+ bX. From our knowledge of regression 
methods it is clear that the equation to be fitted is expressed in the form 


Y -(Y—bX)-4bX 
where b is the regression coefficient and is given by 


Хху 


mo 


where x = X — X and y= Y — Y. We shall make this calculation and 
show later how it may be done by a much easier method. 


10.0 x 37. 
о Ыр Т > ADE > 
CHE m = 50 
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Then the required sum of squares for linear regression is 


(23). O Е 
уз оз Те 


The total sum of squares for Y is of course 


9.42 + 12.62 + 8.7? + 6.32 — ae = 20.25 


An analysis can then be made as follows. 
SS DF MS 


Linear regression 8.71 1 8.71 
Deviations from regression 11.54 2 517 
Total 20.25 3 


The sum of squares for linear regression does not take up a very large 
proportion of the total. This is to be expected, owing to the rise and 
fall in the response which is definitely not linear. This result emphasizes 
the need for fitting by means of a more complicated function such as 
Y— a + bX + cX?, which is known as a quadratic and will give a 
parabola. This curve can be fitted by a procedure similar to that for the 
equation Y — a 4- bX, but the method is rather complicated and a much 
easier one is available by the application of tables given by Fisher and 
Yates [3]. Actually with these tables we calculate the sums of squares 
directly and do not go as far as determining the equation unless it is 
required for a particular purpose. Furthermore, by means of the tables, 
it is quite easy to determine the sums of squares for equations of higher 
degree such as the cubic Y = a + bX + cX? + 4Х%. 

The section from Fisher and Yates tables required for our problem is 
reproduced below together with the values of Y. 


-3 —1 +1 +3 
ا‎ = -1 +1 
E +3 —3 +1 


The required sums of squares аге then calculated as follows. 


Linear response 
(3х 94—1 х 126+1х 8.7+3 x 6.3) gn 
20 | 
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Quadratic response 


ay ЗЫ 2 
(1x94—1x Be x 63) — 784 


Cubic response 


(—1x94-4-3x 12.6— 3 x 8.7 +1 x 6.3? _ 379 
20 i 


where the divisor for each response is the sum of the squares of the 
coefficients from Fisher and Yates’ tables. Thus the first divisor is 
32 4+ 1? + 1? + 3° = 20. 

Adding the sums of squares gives a total of 20.25, showing that all the 
components have been calculated. It is important to note here that the 
quadratic response does not represent the sum of squares due to fitting a 
quadratic of the form ¥ = a + bX + cX*. It represents merely the 
additional sum of squares obtained on fitting the quadratic over and 
above that for fitting the linear equation. Actually the total sum of 
squares that can be accounted for by the quadratic is 8.71 + 7.84 = 16.55. 
Similarly the sum of squares 3.70 represents the additional sum of squares 
due to fitting the cubic. It follows, too, from this that each sum of 
squares represents 1 degree of freedom. 

The complete analysis is: 

55 БЕ М5 


Linear response 8.71 1 8.71 
Quadratic response 7.84 1 7.84 
Cubic response 3.70 1 3.70 


If each determination had been made in duplicate, there would have 
been án additional sum of squares for 3 degrees of freedom that might be 
suitable as an error for testing the significance of the three effects. 

It should also be noted that the effects determined are orthogonal. 
This is true because the sum of the coefficients in each line is zero and 
the sum of the products of the corresponding coefficients for any two lines 
is Zero. 

2. Analysis of regression components in an actual example. In an 
experiment on the effect of adding different levels of bromate in baking 
loaves from a series of flours from 17 varieties of wheat, Larmour [5] 
obtained the results given in Table 10-1. The data are loaf volumes in 
cubic centimeters coded by dividing by 100 and rounding off to one 
decimal place. 
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TABLE 10-1 


LoAF VOLUMES FOR FLOUR OF 17 VARIETIES OF WHEAT, TESTED AT 
4 LEVELS OF POTASSIUM BROMATE ADDED IN BAKING 


"s Level of Bromate Variety Regression Component 
1 2 3 4 Total Linear Quadratic Cubic 
1 10.8 10.6 9.8 8.8 40.0 —68 -08 0.4 
2 9.8 9,6 8.6 82 36.2 -58 -02 1.4 
3 8.5 8.2 77 7.4 318 | —38 0 0.4 
4 82 7.6 7.2 7.0 30.0 — 4.0 0.4 0 
5 104 10.6 9.8 9.4 40.2 -38 -06 1.4 
6 9.6 9.8 92 84 37.0 -42 —10 0.6 
7 90 9.0 8.8 7.8 34.6 — 3.8 -10 -06 
8 8.6 8.7 8.5 8.5 34.3 —05 -01 0.5 
9 9.4 10.0 9.6 9.6 38.6 022-106 1.4 
10 10.0 102 9.6 9.0 38.8 —36 —08 0.8 
11 94 9.6 9.5 9.2 37.7 -07 -05 0.1 
12 8.4 8.7 8.8 8.8 34.7 1.3 — 0.3 0.1 
13 6.6 6.5 6.8 6.6 26.5 0.3 — 0.1 — 0.9 
14 91 89 84 7.8 34.2 —44 -04 0.2 
15 10.8 107 102 100 41.7 -29 -01 0.7 
16 7.5 74 72 12 29.3 —141 0.1 0.3 
17 84 8.6 82 8.2 334 =10 -02 1.0 
| 
Total 154.5 1547 147.9 141.9 5990 |—446 -62 7.8 
TABLE 10-2 
ANALYSIS OF VARIANCE, DATA OF TABLE 10-1 
SS DF MS F 5% Point 
Varieties 70.0097 16 4.376. 401 1.86 
Treatments 6.5947 3 2.198 202 2.80 
Interaction 5,2303 48 0.1090 


Тһе simple analysis of variance for these data is given in Table 10-2. 
The interaction is taken as an error, since there was only one sample of 
flour of each variety at each level. Note that the effect of bromate is 
quite significant, but on examining the treatment totals we find that there~ 
is a slight increase from the first to the second level and a decrease from 
the second to the third and from the third to the fourth. This suggests 
a quadratic effect, so it is of interest to determine the linear, quadratic, and 
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cubic components of the regression of the mean loaf volumes on bromate 
levels. The calculations are: 
Linear component 

(-3 x:154.5 — 1 х 154.7 + 1 х 147.9 + 3 x 141.9? 


85 
17 x 20 ea 
Quadratic component 
1 5-- 154.7 — 147. 9) 
(1 x 154.5—1 x 154.7—1 x 147.9 + 1 x 141.9) 0.5653 
17 x4 
Cubic component 
—1 й .7— ; 9) 
( x 154,5-- 3 x 154.7--3 x 147.9 + 1 x 141.9) 0.1789 
17 x 20 
Total = 6.5947 


Notice that when totals enter into these calculations the divisor is multi- 
plied by the number of observations in each total. 


TABLE 10-3 


ANALYSIS OF TREATMENT EFFECTS INTO LINEAR, QUADRATIC, 
AND CUBIC COMPONENTS 


SS DF MS F 5% Point 


Linear component 5.8505 1 5.850 537 4.04 
Quadratic component 0.5653 1 0.563 519 404 
Cubic component 0.1789 1 0.1789 1.64 4.04 
Interaction 5.2303 48 0.1090 


A more complete analysis сап now be given as in Table 10-3. The 
linear and quadratic effects are significant, but the cubic effect is not. 

Further information can be obtained from the data by an examination 
of the consistency of the linear and quadratic effects for the different 
varieties. In other words, we require for this study the interaction of the 
linear effect with varieties and of the quadratic effect with varieties. The 
interaction components are calculated from the figures given in columns 
7 and 8 of Table 10-1. These figures are obtained by calculating linear, 
quadratic, and cubic components in each line. Thus, 


— 6.8 =— 3 x 10.8 — 1 x 10.6 +1 x 9843 x 8.8 
—08= 1x 108—1x 106—1х 9.8 + 1 x 8.8 
04——1x108--3x 10.6—3 x 9.8 + 1 x 8.8 


etc. 


i 
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Then the sums of squares are: 


Linear effect x Varieties 


6.82 + 5.8 +- • + + 1.0 
( st ) 5.8505 = 4.2665 
20 
Quadratic effect x Varieties 
(0.82 + 0.22 + ° ° ++ 02) 


0.5655 = 0.6297 


4 


The complete partitioning of the sum of squares and degrees of freedom 
is now: 


55 DF 
Varieties 70.0097 16 
Treatments 
Linear component 5.8505 1 
Quadratic component 0.5653 1 
Cubic component 0,1789 1 
Interactions 
Linear х Varieties 4.2665 16 
Quadratic х Varieties 0.6297 16 
Cubic x Varieties 0.3341 16 
Total 81.8347 67 


Lacking an experimental error determined from variation within true 
replicates of the experiment, tests of significance of the interactions must 
be made against residual variation that would seem to be appropriate. 
For example, the linear x varieties component might be tested against an 
error made up from adding the sums of squares and degrees of freedom 
in the last two lines. The quadratic interaction effect could only be 
tested against the cubic interaction effect. 

3, Basic factors in non-linear regression. In Sections 1 and 2 the 
student is introduced to non-linear regression as a simple extension of the 
methods of the analysis of variance applied to experimental data. The 
remainder of the chapter is devoted to the general problems and methods 
of fitting curves to data having non-linear characteristics. In Chapter 6 
we studied methods of fitting a straight line to data representing pairs of 
values of X and Y, where Y was taken to be the dependent variable. The 
frequency with which essentially linear relations are encountered is the 
reason for the emphasis on the linear equation, but as our experience is 
broadened through a study of data of various kinds it is soon realized 
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that in a great many examples a straight line is inadequate and that if a 
mathematical function is to be fitted it must be somewhat more complicated 
than the simple linear form. It should be realized, of course, that the 
variables studied show, in general, variation due to two sources. One 
source is in the actual measurement, and the other source is “organic,” 
or, in other words, due to a multitude of influential factors varying in 
number and intensity. With such variables any tendency towards a 
slightly curvilinear relation would not be brought out except by the study 
of very large numbers. A second point to keep in mind is that usually 
we are dealing with the behavior of two variables over a /imited range of 
variations. For example, in a study of the relation between temperature 
and crop yield, under conditions in nature we have actually a very limited 
range of temperature. Suppose that there is a positive regression of 
yield on temperature under these conditions, and then we decide to set 
up a similar type of experiment under artificial conditions. Obviously if 
the temperature is increased beyond a certain point the yield will be 
decreased, and the complete result will be represented by a curve that 
rises at first in a straight line and then begins to curve downward, falling 
finally to the zero axis. In fertilizer experiments it is a common experience 
to find that yield increases with the quantity of fertilizer applied, but such 
a relation has a very obvious limit. At a certain point there will be no 
further increases in yield, and eventually if enough fertilizer is applied 
the effect is toxic and yields decrease. There are also for biological 
variables certain relations that are essentially curved because of funda- 
mental curvilinear relations in the background. The growth curve is à 
very familiar example. Consider the increase with time of the number of 
bacteria on a plate. Let the unit of time be the time required for a 
bacterium to divide and form 2 bacteria. Obviously the theoretical 
increase, starting with a single unit, is represented by 1, 2, 4, 8, 16, 32, 

* +, and so forth, until conditions become crowded and the rate is slowed 
down. The equation is Y = 2* where Y is the number of bacteria and 
X is the number of units of time. In general such curves are represented 
by equations of the type Y = ab* or Y = ae*, where а and b are con- 
stants and e — approximately 2.718 28 and is the base of the Naperian 
system of natural logarithms. These are referred to as exponential 
functions. 

In general the functions thàt can be used for fitting can be divided into 
two groups: (1) algebraic, such as Y = a + bX + cX?, and (2) exponen- 
tial such as Y = сех where the variable X appears as an exponent. 

4. Reasons for fitting curves. The type of data with which we are 
dealing, the object of the experiment, and other factors enter into the 
reasons for fitting curves. Jt is sometimes stated that the basic reason 
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underlying all regression studies is the setting up of a prediction equation, 
but this is not always the important feature in the study of biological 
data and has perhaps been overemphasized. When faced with the 
analysis of a set of pairs of values of two variables, there is a definite 
necessity for the reduction of these data to statistics that summarize the 
results and to which tests of significance can be applied. This in itself is 
a sufficient reason for fitting a curve. 

The need for statistical reduction and the making of tests of significance 
simplifies also the problem of selecting the type of equation to be fitted. 
All we require is something that is reasonable and gives a good fit; There 
may be several equations that will give a good fit, and of these we should 
select the simplest and the one that can be most easily fitted. 

In this reasoning there is a fairly clear-cut distinction between the two 
types of problems that arise: in the first place, when there is a definite 
mathematical function in the background based on some fundamental 
law, and, in the second place, when the object is merely to reduce the 
data to а simpler form. The first problem arises frequently in experiments 
in physics or chemistry where the assumption frequently is that all varia- 
tions from some mathematical law are due to errors of measurement. 
Success in such an experiment lies not so much in obtaining a curve that 
gives a good fit but in obtaining a good fit to an equation that expresses 
a fundamental law. 

5, Selecting a regression equation where the trend is non-linear. This is 
not always a simple problem as there is a very large number of equations 
from which to choose, and for each we must take into consideration the 
work involved, the goodness of fit, and whether or not valid tests of 
significance can be made. These are of two general types: (1) poly- 
nomials and (2) exponentials. The exponentials are often referred to as 
logarithmic curves because they are best transformed to logarithms before 
fitting. Typical examples are: 


Polynomials Logarithmic 
Y = a + bX (linear) Y = a + blog X 
Y=a+ bX + cX? (quadratic) log У-а- bX 
Y= a + bX + cX? + dX? (cubic) log Y= a + blog X 
etc., with a number of variations etc. 


Reference to Figure 10-1 will give some idea as to the types of data to 
which polynomial curves will give the best fit. Note the general shape of 
polynomials of different degree. The curves of different degree are 
distinguished by the number of inflection points. The first inflection 
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point comes with the cubic, and there is one additional point for each 
degree of fitting. Data requiring a curve with several twists must be 
fitted with a polynomial of high degree. One of the desirable features of 
the polynomial equations is that they lend themselves readily to tests of 
goodness of fit for each degree of fitting. This simplifies the problem of 
selecting the curve required. 

Logarithmic curves are characterized by a flattening out at one end of 
the range. This can be observed quite simply by plotting the logarithms 


пр: Nl eg ЕШ IN| 
Г аа E 
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Straight line 
Y=1+X 


ү=1+з3Х-02Х% 
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[С Quartic 
Y= 10 +04X—05X*—0.01X* + 0.006X* 


Cubic 


У=1+2Х -02X* + 006X* 


11 Ке 


Ficure 10-1. Graphs of 4 polynomial functions. 


of the numbers 1, 2, 3, 4, 5, 6, ° - ‘. It will be noticed that the interval 
between the logarithms becomes less at each step and eventually is very 
small. The characteristics of the above three types should be observed 
by working out the values of Y for some imaginary equations and plotting 
on graph paper. Specially ruled paper for plotting logarithmic curves 
should also be. studied to observe how logarithmic equations are trans- 
formed to linear equations by plotting on the appropriate type of log 
paper. ; 
6. Fitting polynomial equations. If a polynomial of the form 


а УХ + сх + dX 1 


is required, the problem resolves itself into one of determining the values 
of a, b, c, and d that will make the sum of squares of the errors of 
estimation a minimum. The expression to be minimized is 


S(Y—a—bX— cxt dX) 2 
1 
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and this leads to a set of four simultaneous equations as follows. 

an +bEX + сх? + ах = XY 

aXX + bXX?-- с®Х% + dIX* = DXY 


аў? + Ь®Х% + cIX* + а5Х5- ХХ?У 
ау Хз + bSX4 + с®Х5 + dEXS = DX°Y 


This set of equations can be expanded or contracted as desired, according 
to the degree of the polynomial required. If we let r represent the degree 
of fitting, there will be r + 1 equations and r 1 terms on the left, one 
for each coefficient to be determined. 

The simultaneous equations provide a general method of fitting poly- 
nomials of any degree. The procedure is to determine E Y, ZX Y, ХХ?У, 
etc., and the sums of the powers of X. The equations can be set up in 
tabular form and solved by the method with which one is most familiar. 
The Doolittle tabular method is given here, but some workers prefer a 
direct solution by elimination and substitution which is equivalent to the 
Doolittle method. Others prefer the solution by means of determinents 
as in the Crout method described in the Appendix. 

Before studying the details of the solution it should be noted that 
certain simplifications are possible. In the first place it is almost essential 
in order to avoid large numbers to code the values of X and Y to make 
them as small as possible. Starting with a table such as 10-4, to which 
a polynomial is to be fitted, it is desirable first to make unit class intervals 
and then to set up arbitrary origins for X and Y that are close to the 
actual means. 

With certain types of data there are other simplifications, and when 
possible there are definite advantages in fitting what are known as 
orthogonal polynomials, which are described briefly in the next section. 

7. Orthogonal polynomials. If we fit a polynomial of the form 
Y = a, + bX + c, X? to a given set of data and then fit another of the 
form Y = a, + bX + c,X? + а,Х5, we shall find as indicated by the 
notation that а; # as, bı 7 bs, etc. Another way of looking at this is to 
say that, after fitting to the second degree, the procedure of fitting to the 
third degree is not merely a matter of adding a term in X? to the original 
equation. In other words it is not possible to fit by successive stages, 
testing the goodness of fit as we proceed, unless we go through the whole 
procedure of solving the simultaneous equations at each stage. 

With orthogonal polynomials the type of equation fitted is of the form 


Y= A+ Be Ch + Did rt A 
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where the 25 are themselves orthogonal functions of X. For example, 
we might have £j = Ку + X,, & = К + kX + XP, etc., these functions 
being determined in such a way as to make 


Zé = 0, ZÊ, = 0, ° +, ХД, = 0, 2&6 —0,--- 5 
When X can be taken as a series of natural numbers and the Y’s have 


equal weight, the algebraic derivation of the values of the é’s is fairly 
simple. Taking x in place of X, it can be shown that 


f= x 
n 
& = (ғ = 9) 6 
etc. 


The error to be minimized for an orthogonal polynomial of the second 
degree is 
Z(Y— 4 — ВЕ — CE) 


leading to the simultaneous equations 
nA + ВУ + СУЕ = LY 
DEA + BLE? + CEE = DEY 7 
EEA + BLES, + CLE? = LEY 
and, since by definition X£, = 0, Zé, = 0, X££, = 0, these simplify to 


11415 SY. BALE EY O E Ý 8 
or 
XY LEY _ DEY 
A > B DE С хе) 9 


It will be obvious that on minimizing an orthogonal polynomial of the 
third degree we get the same results as above for A, B, and C, and in 
addition we will have 


_ Z&Y 
xt 
Since the sums of squares for regression are given by 
Linear regression BX&Y 
Quadratic regression BX&Y--CX&Y 10 


Cubic regression BX&Y-- CEY + DE&Y 


it follows that the sums of squares for regression for each degree of freedom 
can be obtained at each stage of fitting. 
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8. Example 10-1. Fitting a polynomial to data where the values of Y 
have unequal weight. In an example of this type it is not convenient to 
fit orthogonal polynomials, so we proceed directly by obtaining the 

TABLE 10-4 


Data SET UP FOR CALCULATION OF А POLYNOMIAL REGRESSION EQUATION 


5 Б. 2-15 15 IOS CC Об 
4 1 1 
3 2.5 7 
И 27) 22 45 б 30И 22 
$ ба 79542572 45 
Е 0 0) 09-102 5 56 МІ MIO 
$-1 "T moverat Pu 13 
--2| 4194 709. 11 
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=й 2 2 
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1 жумул Hoe 
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-4 
Seem Mesue e ea 
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Ficure 10-2. Graph of loaf volume means from Table 10-1. 


values of EX, ХХ2,-. , ХУ, SEXY, * and solving a system of normal 
equations as shown in (3). 

Table 10-4 is the usual regression table, but, for the purpose of fitting, 
the actual intervals of Y and Х have been changed to unit intervals and 
an arbitrary origin for Y has been taken at 6 and for X at 5. This is а 
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simple method of coding that reduces the labor of calculation. The first 
step is to decide on the polynomial to be fitted, and accordingly a graph 
is made of the means of the Y arrays, Figure 10-2. It is obvious that a 
straight line will not give a good fit, but there is little indication that we 
shall need to go beyond the third degree. As a matter of fact a curve 
of the second degree should give a good fit, but we shall proceed to the 
fitting of a third-degree curve, and if the three degrees of fitting are not 
required it will be a simple matter to revert to the quadratic. 

Table 10-5 must be set up next in order to obtain the values required 
for the solution of the normal equations, as written out below. 


ап +bIx + сӰх? = dy 


аУх + БУХ + сУх3 = Ixy 


aXx* + bux? + cEx* = Xxy 
where x = Х- 5, and y= Y — 6. 


TABLE 10-5 


CALCULATION OF DATA REQUIRED FOR FITTING A POLYNOMIAL OF 
THE THIRD DEGREE 


1 2 3 4 5 6 Я 8 9 10 1 12 
Totals 
x y әу ху ху fe fox fev fot м/с o fot 
Arrays 


—4 —17 68 —272 1088 5 -20 80 —320 1,280 — 5,120 20,480 
-3 —35 105 —315 945 13 —39 117 —351 1,053 — 3,159 9,477 
—2 —23 46 -92 184 23 —46 92 —184 368 — 736 1,472 
-1 —1 1 --1 If 723-2223: 23 23 23 — 23 23 


1 10 10 10 105715 15 І5 15 15 15 15 
2 19 38 06: 1527216; 32 64 128 256 512 1,024 
3 43 129 387 1,161 26 78 234 1702 2,106 6,318 18,954 
4 8 32 128 7 512 36 24 96 384 1,536 6,144 24,576 
5 2710 50-2507 2 10 50 250 1,250 6,250 31,250 


30 439 -29 4,303 164 31 771 601 7,887 10,201 107,271 
Xy ху Ixy Ixy п Ex Xx dx? Хх 2х5 Zx* 


Note that columns 1, 2, and 6 are taken directly from the original 
regression table, 1 and 6 being copied and 2 worked out in the usual 


| 
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way. We can then work right across the table from left to right as 
follows. 
—4x17— 68 4x 68=—272 4 x — 272 = 1,088 


4x 5——20 ша 9 ІІ 252 ет 


etc. 


The columns are summed to give the values indicated in the last line. 
The equations to be solved can now be set up in the following manner. 


164a + 31b+ Tle + 60ld= 30 

3la+ 771b + 60le+ 7,8874 = 439 
Tlla-- 601b + 7,887с + 10,2014 = — 29 
601a + 7,887b + 10,201c + 107,2714 = 4,303 


Table 10-6 is then set up for solving the equations by the Doolittle 
method. The abbreviated Doolittle method can be followed if desired, 
but the full solution is more convenient here. Note that the procedure 
followed in this table is the same as that described in Chapter 8, Section 5, 
for the abbreviated Doolittle solution, but all terms are written down. 
Check by inserting the determined values of a, b, c, d in all the equations. 

The polynomial values can now be built up from the equation 


y = 0.5728 + 0.5957x — 0.1096? + 0.003 533x* 


where у = Y — 6 and x= X— 5. Note that, for X = 1, x= 1-5 
— — 4, Therefore, substituting — 4 in the above equation is equal to 
substituting X = 1. For x = — 4, we get y= — 3.79. Then Y —6— 
3.79 = 2.21, etc. Thus it is not necessary to put (Y— 6) for y and 
(X — 5) for x and simplify before calculating polynomial values. 

The most interesting part of the work is in testing the goodness of fit 
of the regression curve. Table 10-6 will provide the data for testing the 
goodness of fit not only of the third-degree curve but also of second- and 
first-degree curves. This enables us to decide the degree of fitting required. 
We first calculate the total sum of squares of Y from Table 10-4. This 
gives 


(30) 
— C = 428.51 
434 164 
Then from Table 10-6 the sum of squares for regression due to fitting 
the straight-line equation yo = 40 + box is given by (5, 2) x (6, 5)? where 
the numbers refer to row and column, respectively. The figures are 


765.1402 x (0.566 339 7)? = 245.41 


SOLUTION OF SIMULTANEOUS EQUATIONS FOR FITTING A POLYNOMIAL OF THE THIRD DEGREE 


TABLE 10-6 


1 2 3 4 5 
1 164 31 771 601 30 
2 1.0 0.189 024 4 4.701 220 3.664 634 0.182 926 8 
3 771 601 7,887 439 
4 — 5.859 756 — 145.7378 — 113.603 7 — 5.670 732 
5 765.140 2 455.262 2 7,773.396 3 433.329 3 
6 1.0 0.595 0049 10.159 44 0.566 339 7 
7 7,887 10,201 — 29 
8 — 3,624.640 6 — 2,825.433 2 — 141.036 6 
9 — 270.883 2 — 4,625.208 9 — 257.833 1 
10 3,991.476 2 2,750.375 9 — 427.869 7 
11 10 0.689 062.3 — 0.107 195 8 
12 107,271 4,303 
13 — 2,202.445 0 — 109.939 0 
14 — 78,973.353 3 — 4,402.383 0 
15 — 1,895.180 3 294.828 9 
16 24,200.021 85.5069 
17 10 0.003 533 339 
18 d = 0.003 533 34 0.003 533 34 
19 c = — 0.109 630 5 — 0.002 434 69 — 0.107 195 8 
20 b = 0.595 673 6 0.065 230 7 — 0.035 896 8 0.566 339 7 
21 а 0,572778 7 — 0.112 596 8 0.515 397 1 — 0.012 948 4 0.182 926 8 
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For fitting a quadratic of the form уу = а; + bx + сұх?) the sum of 
squares for regression is 


245.41 + (10, 3) x (11, 5)? = 245.41 + 3991.4762 x (0.107 195 8)? 
= 09127 


Finally, for the cubic ys = a; + bax + cgx? + Ф, the sum of squares is 
245.41 + 45.86 + (16, 4) x (17, 5} 
= 245.41 + 45.86 + 24,200.021 х (0.003 533 34): = 291.57 
The next step is to set up an analysis of variance for testing the increase 


in the sum of squares for regression due to each degree of fitting. We 
һауе i 


SS DF MS 
Linear regression 245.41 1 245.41 
Excess due to quadratic 45.86 1 45.86 
Excess due to cubic 0.30 1 0.30 
Residual 136.94 160 0.86 
Total 428.51 163 


Obviously there is significant linear regression, but the fitting of the 
quadratic makes a very worth-while improvement. The fitting of the 
cubic is of no value whatever. 

9, Fisher’s summation method of fitting orthogonal polynomials. When 
the Y values are, or can be assumed to be, of equal weight and are given 
for equal intervals of X, the method of fitting orthogonal polynomials 
developed by R. A. Fisher [2] provides a decided short cut from the 
actual to the theoretical polynomial values. The arithmetical labor is 
likewise easy as it consists largely of a process of continuous summation 
that can be done on an adding machine. A more recent development 
arising from the same mathematical background is a method of fitting 
given by Fisher and Yates [3] which is still more rapid, but a calculating 
machine and certain tables supplied by them are required. This method, 
described in detail in Example 10-3, was used in Sections 1 and 2. 

Fisher’s summation method can be described most conveniently by 
means of an example. 

10. Example 10-2. Fitting orthogonal polynomials—summation method. 
The data given in Table 10-7 are for loaf volumes in a baking experiment 
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reported by Larmour [5] wherein the dough was subjected to mixing in 
a high-speed mixer for varying lengths of time from 1 to 4 minutes. 


TABLE 10-7 


RESULTS OF AN EXPERIMENT ON THE EFFECT OF TIME OF 
MIXING OF DOUGH ON LOAF VOLUME 


Mixing Time in 


Minutes (X) 


1.0 
1.5 
2.0 
2.5 
3.0 
3.5 
4.0 


Loaf Volume 


in Cubic 


Centimetres less 550 (У) 


235 


The procedure of fitting is divided into a series of steps as indicated 


below. 


1. Calculation of So, $, Sa, * - 


+, S, where r represents the degree of 


fitting to be carried out. The data are set up as in Table 10-8, after the 
actual values of Х have been replaced by the natural numbers 1, 2, 3, 
++ +,7. The process of continuous summation is illustrated for obtaining 
So, 51, Sg, 53, it being assumed at this stage that a third-degree curve will 


be sufficient. 
TABLE 10-8 
CALCULATION OF So, 51, Sa, Sg 

x x 
1 235 235 235 235 
2 280 515 750 985 
3 255 770 1,520 2,505 
4 190 960 2,480 4,985 
S 120 1,080 3,560 8,545 
6 55 1,135 4,695 13,240 
7 20 1,155 5,850 19,090 

1,155 5,850 19,090 49,585 

So Sı 5; 5; 


In the third column 
235 + 280 = 515 
515 + 255 = 770 

etc. ` 


Similarly in the fourth 
and fifth columns 
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2. Calculation of a, b, c, d, e,- - - anda’, Б, с, d', &',* +, where there 
are r + 1 of these to be calculated for fitting to the rth degree. 


Degree of 
Fitting 
1 
0 а= – So а = а 
п 
2 
1 b6=—— Б = аЬ 
n(n + 1) 5 g 
6 a 
2 Sy c —a— 3b + 2с 


* na + D 2 


24 
а= ——__—_____§,  d'=a-—6b + 10с- 5d 
n(n + 1)(л + 2)(n + 3) * 11 


120 


„у ue ES TOLE 200 3 LO 
© ДУККА, 


720 


--------5 " = 15b + 70c — 140d + 126e — 42a 
f^ GE) зы f 


(r+ 1)! 

У СТЕЧЕНЕ 
Coefficients іп the equations оп the right above аге formed by successive 
rr+1) (—1€-2 (г-2Хг--3) 
15506 Die deh e 
series terminates. For example, in obtaining coefficients for calculating 


etc., until 


multiplication by 


\ | 3x4 Di Se 152674 
| а, "= 3. Then 5 =66х5=3 10, 10 x و‎ = 5. 
Here we have 
f a = }(1155) = 165.000 00 a’ = 165.000 00 
b = (5850) = 208.928 57 b’ = (165.000 00 — 208.928 57) 
= — 43.928 57 


c = 4,(19,090) = 227.26190 с = 165.000 00 — (3 х 208.928 57) + 
(2 x 221.261 90) = — 7.261 91 


d= 43,(49,585) = 236.11905 4' = 3.45233 
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3. Calculation of sums of squares for each degree of fitting and making 
tests of significance. At this point, tests of significance can be made that 
will determine the degree of fitting required. The sums of squares for 
each degree of fitting are given by 


Degree of 
Fitting Formula for Sum of Squares In Example 
0 na^ (represents fitting of 7 x (165.0)? = 190,575.00 
mean) 
1 
1 3 met) p 28 x (43.928 57? = 54,032.14 
n(n + 1)( + 2) 
2 5 Ка 4 ? = 5 
(n— DRED c 84 x (7.261 91) 4,429.77 
n(n + 1): + + (n+ 3) 
3 p "a айы; 
G= DREDGE 3) d 294 x (3.452 33) 3,504.06 
4 n(nc1)-**(-4 oy 
(n—1)n—2)* + - (n—4) 
5 n(n+1)+ ° ‘(n + 5) f^ 
(-1)!-2)::-(п- 5) 
A E E ED 


m GSE TED) (constant)? 


Since ЖУ? = 252,575, Xy? = 252,575 — 190,575.00 = 62,000.00, and 
we can set up an analysis of variance as follows. 


SS DF MS 
Linear regression 54,032.14 1 54,032 
Excess due to quadratic 4,429.77 1 4,430 
Excess due to cubic 3,504.06 1 3,504 
Residual 34.03 3 11.35 
Total 62,000.00 6 


The regression sums of squares are all highly significant. The question 
as to whether or not we should proceed to further stages is answered 
reasonably well by the smallness of the residual sum of squares. It does 
not seem likely that additional fitting would reduce this appreciably. 
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4. Calculation of polynomial values. In order to make a graph of the 
fitted curve it is necessary to determine the values of the polynomial for 
each value of Х. The procedure is first to obtain the quantities Y, АУ), 
AY, ДУ, + >>, AY, given by the following general formulas for fitting 
to the fifth degree, which can be expanded or contracted as desired. 


General Formula for Fitting to 
the Fifth Degree In Example 
Y, =a’ + 3b’ + 5с + 74 + 9e' + Mf* Y, = 165.0 + (3 x — 43.928 57) + 
(5 x — 7.26191) + (7 x 
3.452 33) = 21.071 05 


6 
AT DIS 7 + ъа + AY, = (9 [С 43.928 57) + (5 x 
ES 
55/7) —7.261 91) + (14 x 3.452 33)] 
= 31.905 50 
60 60 
AY, = (= D8 — —— (e 7d 0700 АҮ -- х (- 726191): Ce 
а-(-1) RADE + 7d' + 27е a= 30 ¢ ) ( 
тү) 3.452 33) = 33.808 80 
840 840 s 
AY, =< 1)8 45974 AY,=——— x 345233 = — 24.166 31 
$C) Gabe De-a + 9e' + 3 120 
44f’) 
oar: 15,120 б +? 


(n— 1)(пл— 2): > *(n—9 


332,640 


GeO ц 


AY; = (- 1) 


The polynomials are finally built ир by a process of continuous summa- 
tion originating with the values of Y;, AO АСА татыу) т 
process is illustrated with the data of the example: 


х Y, 
1 23631 

2 277.02 — 40.714 

3 254.88 22.143 - 62.8564 

4 194.05 60.833 — 38.6901 

5 118.69 75357 — 14.5238 

6 52.98 65.714 9.6425 

7 21.071 31.9055 33.80880 — 24.16631 


The starting point of the summation is at the lower right-hand corner. 
Thus 33.808 80 — 24.166 31 = 9.6425, 9.6425 — 24.166 31 = — 14.5238, 
etc., where all decimal places are carried but only four are written down. 
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In the next column, 31.9055 + 33.8088 = 65.714, 65.714 + 9.6425 
= 75.357, etc., again carrying on the machine one more place than is 
required. И 

11. Fitting orthogonal polynomials using Fisher and Yates’ table of 2. 
When available, the Fisher and Yates [3] tables provide the most rapid 
method for fitting orthogonal polynomials. It is required that the values 
of Y be of equal weight and at equal intervals of Х. 

The equation to be fitted is of the form 


Y,— A+ Bh + Ci + Di +++ 14 


where the £'s are themselves polynomials in X (see Section 7). For 
simplicity in fitting, the actual equation is 


Y,—A- BU CJ Dg ° ° 15 
where &' = ДЕ and the values of 2 are given in the tables. ; 


A small section of the tables of &’ is reproduced here for fitting a fifth- 
degree polynomial to 7 points. 


x Ў ғ 5 8 БЛ об 


1 т=з E ES з 2 
2 ҮЗЕ? 0 1027 4 
3 E m 1 INN. 
4 T (y =4 0 6 0 
5 Y; П 85 AES | 1 5 
6 ї; 2 (Ле aie к SA 
1 Y, 3 5 1 3 1 


Г” 28 84 6 154 84 
д 1 1 16 7/12 7/20 


To fit a straight line we require the values of £'; only; for the quadratic 
we require £'; and £',, and so forth. 

The first step is to determine the sums of squares for each degree of 
fitting and, by setting up an analysis of variance, to decide on the poly- 
nomial required. The sums of squares are given by 


^» 
First degree S , 
1 
Second degree EYE 
8 Lee 


etc. 
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Having decided on the degree of the polynomial that is required, we 
determine 


EYE, 
JU ee E 
DG 17 
ЖЫ 22/55 
po шге 
XE 
etc. 


Then, by substituting the tabulated values of 271, 2, etc., in the equation, 
the polynomial values can be determined. For example, in fitting a 
fourth-degree curve to data for 7 points, we would have 


Y, =A’ + B(—3) CO + D'C- 1) + EG) 
Y, — A' 4- B(—2)-- C(0) -- DD FEED 


etc. 


1f necessary, equation (15) can be transformed into an equation giving 
Y in terms of X. It is best to do this in two stages, transforming first to 
x and then to X. This is done by means of the identities given below 
for equations up to the fifth degree. 


Еу = А = Ах 


SgB— 7) , , 15m — 230n? + 407 | 
E' 5 3 
Sy Abs ДЕ ea 1008 » 


where the values of А are different in each line and are given in the tables 
of Fisher and Yates. 

After finding the numerical values of the expressions containing л 
in the above formulas, these are substituted in (15) and simplified. If 
the equation is required in terms of X instead of x, it is necessary to 
substitute (X — X), in which X is the actual numerical value of Х, and 
simplify further. 
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12. Example 10-3. Fitting polynomials with Fisher and Yates’ table 
of 5. This process of fitting polynomials by successive stages using 
Fisher and Yates’ table of £' is illustrated with the data of Table 10-9, 
for the relation between pH and activity of the enzyme asparaginase, 
Geddes and Hunter [4]. The table also gives the values of 2” which һауе 
been taken from Fisher and Yates’ table. 


TABLE 10-9 


DATA For pH(X) AND Acrivity OF THE ENZYME ASPARAGINASE (Y), WITH VALUES 
oF & REQUIRED FOR FITTING A FirTH-DEGREE ORTHOGONAL POLYNOMIAL 


Y for 
Fourth-Degree 

x Y 6 5 ғу НЙ ё Equation 
1 0.2 =з E =1 143 — 143 0.27 
2 0.4 -1 TET 11 187 0.17 
3 1.4 -9 2 66 — 132 132 1.69 
4 4.1 = 9 98 = 92 — 28 4.02 
5 6.6 ap =s 95 242: Sikh) 6.48 
6 8.7 =3 =7 67 63 — 145 8.55 
7 9.8 -1 =8 24 108 — 60 9.89 
8 9,9 Б 528752524 108 60 10.26 
9 9.5 3 =7 =6 63 145 9.62 
10 8.2 SX --57 195 =B 139 8.05 
11 6.4 П 2527 2198; - 92 28 5.81 
12 3.3 9 Z 6 1525 132 3.29 
13 0.3 П 7 1 Exe eat 1.04 
14 0.1 сказ 143 143 143 — 0.24 

| 
Sum 105 689 |Х() 910 728 97,240 136,136 235,144 68.90 
Mean 7.5 4,9214 Ал 2 12 5/8 7/12 7/30 


The calculations for 4, B', C', D', E', and F' are given below, together 
with the sums of squares representing each additional degree of fitting. 


xy? = 541.7100 — 339.0864 = 202.6236 
2 
X(FY)-— 4.3 B= 0.045385 SS = E 1.8744 
2 
Х2,Ү)--361.8 С' =— 0.49698 SS = 56-5) 179.8066 
2 
X(FY)-— — 574.2 р = — 0.005 9050 SS= 6742 = 3.3906 


97,240 
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(14844) — 


ХҒ,Ү)- 14844 Е = 0.010904 99 = _——— 16. 
(547) 136.136 6:1930 
7 (162.4)? 
ХЕ,Ү)- 162.4 Е = 0.000 690 64 55 = = 0. 
(25У) 235144 0.1122 
Residual = 1.2542 
The analysis of variance takes the form given in Table 10-10. 
TABLE 10-10 
ANALYSIS OF VARIANCE FOR INDIVIDUAL DEGREES OF FITTING 
ss DF MS F 5% Point 
First degree 1.8744 1 1.8744 12.0 5.32 
Second degree 179.8066 1 179.8066 1147 5.32 
Third degree 3.3906 1 3.3906 21.6 5.32 
Fourth degree 16.1856 1 16.1856 103 5.32 
Fifth degree 0.1122 1 0.1122 
Residual 1.2542 8 0.1568 


It is clear from this analysis that a polynomial of the fourth degree 
should be fitted, The simplest possible equation that will give the 
polynomial values is 

Y=A+ BE, + CE, D'E, + ЕР, 19 
Therefore 

Y, = 4.921 + 0.045 38 (— 13) — 0.4970 (13) — 0.005 905 (— 143) 

+ 0.010 90 (143) = 0.273 


and, by substituting further values of 27, Ғ, £3, and Ё the complete 
set of polynomial values are obtained as given in the last column. 

The equation of the polynomial can be expressed in terms of powers of 
X, but a little more labor is required. In the first place 471, Ғ, 23, and ғ, 
must be expressed in powers of x, where x, as usual, represents (X — Х). 
The equations for сопуегвіоп are given іп (18). Іп using these it must 
be noted that £', — &, and as follows for the other values of pe 

& = 26, = 2x | 
Ea = 15, = Ho? — 16.25) = fx? — 8.125 

Zé = 3 — 29.05) = 333 — 48.4167x‏ = ,ا 
Ea = уз = d — 41.0714х2 + 195.3482)‏ 


= рухі — 23.9583x? + 113.9531 
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Substituting in (19) gives 
2 
Y = 4.9214 + 0.045 38(2x) — 0.4970 - - 8.125) - 


0.005 905 (5х3 — 48.4167х) + 0.010 90 (4x — 23.9583x? + 113.9531) 
= 10.2016 + 0.3767x — 0.5096x? — 0.009 842x? 4- 0.006 358x* 


This equation gives estimates of the values of Y by substituting deviations 

from the mean of Х. Thus 

Y, = 10.2016 + 0.3767(— 6.5) — 0.5096(42.25) — 0.009 842(— 274.625) 
+ 0.006 358 (1785.0625) = 0.27 


If necessary we can substitute (X — 7.5) in place of x and simplify 
arriving at an equation in which X can be substituted directly. 


13. Exercises. 
TABLE 10-11 


DIFFERENCE IN RATE OF GROWTH OF LARVAE OF Choristoneura 
fumiferanae (Clem), THE SPRUCE BUD Worm, FED ом 2 TYPES 


OF Foop* 
Instar 2 3 4 5 6 7 
Difference 6 20 63 75 64 39 
TABLE 10-12 


AGE OF ONSET OF MUSCULAR DYSTROPHY IN PAIRS OF 
First Cousins (Bell) 


10 68 28 13 Y 4 4 134 


60-69 2 1 1 4 
50-59 1 1 1 1 4 

40-49 3293; 1 7 

Vx 30-39 DET 3 1 2 13 
20-29 Sib)» 20 3 1 28 

10-19 | 2 56 9 1 68 

БОШ КЫЕК? 10 


0-9 10-19 20-29 30-39 40-49 50-59 60-69 
Age of X 


Mean Squares: Quadratic = 13.653 
Cubic = 0.203 
Residual = 0.930 


* Data courtesy of R. Lejeune, Forest Insects Laboratory, Winnipeg, Manitoba. 
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1. Fit an orthogonal polynomial by the method of Example 10-3 for the data of 
Table 10-11, determining the best fitting curve by testing at each stage of fitting. 
Calculate the polynomial values and graph with actual values on the same chart. 

2. Table 10-12 was obtained in a study by Julia Bell [1] on hereditary muscular 
dystrophy. This is a symmetrical table, familiar in heredity studies, owing to 2 entries 
being made for each pair. Take Y and X as shown and calculate and graph the means 
ofthe Y arrays. Carry the fitting of a third-degree polynomial, as in Example 10-1, 
to the stage of testing the linear, quadratic, and cubic components of regression. 
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CHAPTER 11 


Basic Experimental Designs 


1. Principles of experimental design. "The logical procedure of 
drawing conclusions from experiments is analogous to the everyday 
procedure of learning by experience. The main difference is that in 
experimental work our experience is planned in advance, whereas in 
merely learning by experience we are exposed to events in a more or less 
haphazard fashion. In theory, therefore, we should be able to learn 
rapidly and efficiently by performing experiments, but, although there is 
a distinct advantage from being able to plan in advance, there is also a 
danger in that our planning may be faulty, and the experience from which 
we had anticipated learning turns out to be a waste of time and energy. · 
There is a great need, therefore, for careful thought with respect to experi- 
mental design. There is need for searching investigations into principles 
of design in order that these may stand out clearly and supply a framework 
on which sound designs may be built. 

Although certain principles of experimentation were developed gradu- 
ally along with increasing research work in all lines of endeavor, it is now 
generally accepted that the fundamental principles of experimental design 
were for the first time brought clearly into focus with the writings of 
Professor R. A. Fisher and especially with the first edition of Statistical 
Methods for Research Workers in 1925, and The Design of Experiments in 
1935. Previous to 1925, Professor Fisher had developed the technique 
of the analysis of variance and discovered the distribution that furnished 
appropriate tests of significance in conjunction with the analysis pro- 
cedure. It is possible, however, that the greatest contribution made by 
Professor Fisher was his attitude toward the logic of experimentation, 
Research workers generally draw their conclusions by inductive inference. 
They reason from the particular to the general. This they are forced to 
do because there is no alternative, but mathematical methods are usually 
deductive, and there existed a great gulf between the experimentalist and 
the mathematician, which very few had been able to bridge. Great 
contributions such as those made by Professor Karl Pearson were largely 
ineffective in experimental work because of a general belief that they were 
useless unless we could deal with large samples. A start was made by 

192 


(1) PRINCIPLES OF EXPERIMENTAL DESIGN 193 


“Student” in 1908 with the development of the distribution that evolved 
finally into the distribution of +. Then with the appearance of Professor 
Fisher’s Statistical Methods for Research Workers the experimentalist 
found for the first time a type of statistical reasoning that was essentially 
similar to his own. There was complete recognition of inductive logic 
and at the same time a presentation of statistical tools that could be 
applied without breaking faith with mathematical principles of proba- 
bility. Only history will give the complete picture, but a personal opinion 
can be ventured that this point of view, presented by Professor Fisher and 
since developed by many others, was the primary cause of a change of 
attitude on the part of experimentalists towards statistical methods, and 
the rapid increase in the application of statistical methods in practically 
all research laboratories, We shall find it profitable, therefore, to 
summarize a few of the basic principles of experimental design at this 
point. Others will emerge as we proceed, but the most essential can be 
stated now. 


1. The experiment must be free from bias. Ап experiment must be planned so that 
it gives unbiased estimates of the values we wish to measure. This is not simply a 
matter of the experimenter claiming that there is no bias in his experiments. It is a 
matter of the design being such that no bias on the part of the experimenter can possibly 
enter into the results. This is accomplished mainly by randomization. We do not 
say, merely, “These two greenhouse benches аге as nearly alike as possible with respect 
to conditions, so I will put treatment 4 on the plants on this bench and treatment B 
on the other one.” As a routine procedure we allow a table of random numbers to 
decide the issue. In administering two treatments to a series of pairs of animals we 
do not select the animals to receive the alternate treatments “without bias.” We make 
sure that there is no bias by referring again to a table of random numbers. 

It should not be inferred that the absence of a conscious bias on the part of the 
experimenter is sufficient. In classifying wheat plants at the Dominion Laboratory 
of Cereal Breeding into two classes, resistant and susceptible, there was no anxiety 
on the part of the plant breeders to obtain any specific ratios, but it was found that if 
one pile of plants was quite small there was a tendency to place borderline plants in 
this pile rather than in a larger one. For some illuminating accounts of unconscious 
personal bias the student is referred to an excellent article by Yates [14] cited at the 
end of this chapter. 

2. There must be a measure of error. This principle is related in part to the necessity 
for freedom from bias. The treatments being compared may appear to be producing 
decidedly different results. But this is merely a matter of opinion on the part of the 
investigator. It is a purely subjective conclusion. The true experiment is one that is 
strictly objective. It should itself furnish a measure of error and this error alone should 
be the measuring stick of significance. To argue against this is merely to argue fora 
type of experiment that is not really a scientific experiment because its significance 
depends on the judgment of some individual. ^ 

Are we to discard all experimental results where there is no measure of error? This 
would seem to be the alternative in view of what has been said above. Actually, the 
technique does not need to be quite so drastic, because when we look more closely at 
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almost any experiment we find usually that there is some measure of error, and if this 
error is chosen objectively the experiment may be salvaged. A series of unreplicated 

` plot treatments can be carried on over a period of years, and the years will then con- 
stitute replications and an error can be obtained. Even ina single series of unreplicated 
plots certain comparisons can be made by making an orthogonal division of sums of 
squares and degrees of freedom. Finally, there are cases where the results are definitely 
significant and there is no measure of error, as in a pair of plots to which we apply 
a method of weed control and get a 100% kill in the treated plot and no kill whatsoever 
in the control plot. Basically, however, our thinking in such instances is dependent 
on the existence of an experimental error that we can gauge by experience. Our 
knowledge of plots and weeds tells us that such a result cannot have happened by 
chance. If this is true, it is admittedly desirable to һауе a real measure of error furnished 
by the experiment itself. 

We should also keep in mind the possibility of errors of the second kind (see Chapter 
4, Section 1). If we have no measure of error and will accept only those results that 
are obviously significant, we must be prepared to allow a great many real effects to 
go unnoticed. In these days of accurate and well-controlled experiments, the latter 
may be a greater "sin" than pronouncing an effect to be real When it actually is not. 

3. There must be a clearly defined objective. From experience in giving advice to 
experimenters and discussing their plans with them, one soon learns that experiments 
are very frequently designed without a perfectly clear objective. We frequently hear 
“I want to compare this series of treatments." This is not enough. We must have 
a good reason for selecting these treatments. Do they represent the combinations and 
levels that will give the required information? To which areas are the results to apply, 
and under what conditions? Are we seeking merely empirical results or do we want 
to know the reason why? What measures are to form the basis of comparison? On 
looking over these questions it must be realized that unless we know the answers we 
cannot set up a null hypothesis, and if we cannot set up a hypothesis we cannot make 
a test of significance. If there is no hypothesis, there is no experiment. 

The logic of the methods of statistics is an essential in designing an experiment 
correctly, because it teaches us the importance of clear thinking with respect to the 
hypothesis that we wish to test. Possibly, we can regard the hypothesis as the core 
of the experiment or the foundation on which it is built. Without this foundation we do 
not know how to plan the experiment, nor do we know how to make tests of significance. 

4. The experiment should have sufficient accuracy to accomplish its purpose. Increased 
accuracy in experimentation is brought about, first, by the elimination of technical 
errors, and second, by increased replication. Generally speaking we think chiefly in 
terms of replication as our principle method for improving accuracy, and this is true 
with respect to uncontrolled errors or to errors that are inherent in and characteristic 
of the experimental material, but it should not be assumed that replication will overcome 
errors in technique. Although replication will rend to overcome such errors, it is a 
уегу uneconomical way to accomplish the desired end. 

Assuming that technical errors are reduced to a minimum, we have to decide on the 
number of replications required to produce a given degree of accuracy. The actual 
method will be elaborated later, but we can emphasize at this point that it is poor 
economy to cut down on replications to such an extent that the results of the experiment 
may be lost. Usually, there is previous experience that can be applied in order to 
arrive at a rough measure of the number of replications required, but in any new field 


of research where the extent of the errors is uncertain some preliminary investigation 
may be necessary. 
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5. The experiment should have sufficient scope. To illustrate the meaning of scope 
in the design of an experiment we can consider the two extremes. Suppose that an 
experiment is planned in which we wish to estimate the effect of milk as compared to 
no milk in the diet of young pigs. We ask ourselves immediately: To what actual 
ages do we refer when we say “young pigs"? In other words, we are asking what is 
to be the scope of the experiment. If it is limited in scope, it will apply to one particular 
stage of growth. If it is broad in scope, it will apply over the whole range of growth 
during which milk is likely to be fed. There is, of course, no objection to confining 
such an experiment to one stage of growth only. This experiment would have adequate 
scope for deciding on the value of milk in the particular stage studied, but it would be 
decidedly limited in scope if we wished to know the effect of milk throughout the whole 
period. Here again we have the principle of a clear objective. Knowing the objective, 
we can decide on the scope required. 

The value of having adequate scope in an experiment was at one time thought to be 
taken care of by repeating the experiment several times, in each experiment holding all 
but one of the factors constant. This procedure is now outmoded and has been 
replaced largely by what is known as factorial experimentation. In a factorial experi- 
ment, we vary all the factors simultaneously that are required to be varied in order 
to obtain sufficient scope. For example, suppose that two kinds of fertilizers are to 
be tested at different levels. We do not set up one experiment to compare them at the 
first level, another at the second level, and so forth. We make up all the required 
combinations of fertilizer and levels and test them all in one experiment. This adds 
further to the scope of the experiment by enabling us to test the interaction effects of 
the fertilizer and levels. 


2. Ungrouped randomized experiments. This is the simplest of the basic 
designs. Suppose that 2 treatments 4 and B are to be tested on a series 
of 100 field plots. The ungrouped experiment would involve selecting 50 
plots at random to receive A and 50 to receive B. The analysis of variance 
of the yields* would be as follows: 


DF 

Treatments 1 
Error 98 
Total 99 


If the yields of treatment A are X, and those of В are Xy, the sums of 
squares are calculated from 


50 (ХХ, Ji 50 (5 ж ә 
TISSUE 1 
Error (SS) 2X as 50: + 2 » 50 
j XX Sy 
Treatments (SS) — pue ce 2 


100 


* Yield is used here and throughout the remainder of this book in a general sense. 
Tt may represent real yields in pounds, bushels, or tons per acre, or it may be strength 
of straw, the gain in weight of an animal; or some other characteristic that has been 
measured and is appropriate for comparing the treatments. The term treatment may 
also be interpreted in a broad sense. 
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We are not so much concerned here with how to carry out such an 
experiment as we are in assessing its value as a design. It has certain 
advantages, such as simplicity and ease of calculation, and if one or more 
plots are missing there is no loss of orthogonality. Its chief disadvantage 
is that there is no attempt in the structure of the design to reduce the 
error to a minimum by the exercise of error control. The simplest design 
in which there is exercise of error control is the next one to be discussed. 

3. The randomized block experiment. This name was first applied to 
field experiments by R. A. Fisher. It had been a common practice 
among field plot investigators to have 2 or more replicates, each replicate 
containing a complete set of the treatments, but the usual practice was 
to arrange the treatments in systematic order in the replicate. Fisher [8] 
pointed out that the systematic arrangement was likely to cause a bias 
and introduced the principle of arranging the treatments within each 
replicate strictly at random. In terms of the analysis of variance and the 
making of tests of significance, the systematic arrangement annuls the 
assumption that the errors be uncorrelated. Plots that are adjacent tend 
to yield alike, and hence ‘they will have similar deviations from the 
replicate mean, the treatment mean, and the general mean. Randomiza- 
tion does not render the yields of adjacent or nearby plots uncorrelated, 
but it distributes the correlation effect over the whole set of plots in such 
a way that the F distribution is realized. 

Assuming that there аге c treatments and ғ replicates in a randomized 
block experiment, each replicate contains c plots, one of each treatment. 
For 6 treatments A, B, C, D, E, Fin 3 replicates the design might be as 
follows, where the treatments have been arranged entirely at random. 


Replicate 1 D B A F с Е 
Replicate 2 Е C B Ey WD A 
Replicate 3 с Ip B D A E 


This is the design as we set it down on paper. In the field the treatments 
would occupy plots in positions corresponding to those in the diagram. 
The data from the experiment would be set up in an x c table, and the 
method of analysis would follow that of Example 5-2. 

The outline of the analysis for a randomized block experiment will be 


DF MS 
Replicates r—1 
Treatments ea Е, 
Error (r— IXe- 1) E, 


Total тс 1 
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The error mean square will provide the standard errors. 


SE (single plot) = VE, 3 
HB 
SE (mean of 1 treatment) = / Ee 4 
P 
г 2Е, 
SE (difference of 2 treatment means) = /--- b 


and a significant difference between 2 treatment means selected at random 


will be* 2E 
10.05 7! = 6 


* After having conducted an experiment and found a significant F, it is a commen 
practice to select 2 means and make the / test given in (6) above. Actually this ргосе- 
dure is correct only if the 2 means are selected at random, and then of course it has 
very little practical value. The difficulty arises from the fact that, if the means are 
arranged in order from the lowest to the highest and we select, say, the 2 extreme values, 
a significant t will nearly always be obtained even if F indicates that no significant 
differences are present. Snedecor [10, page 409] points out that such a test is not 
really a test of a random difference but the range of the random means included. It 
is possible to make a / test of the significance of a range based on the distribution of 
the range divided by its standard deviation, but on making a series of such comparisons 
these are complications arising from their not being independent. This whole problem 
has been considered by Tukey [11], and a method given for dividing up the ranked 
means into groups. D. B. Duncan (1. A Significance Test for Differences between 
Ranked Treatments in an Analysis of Variance, 1951, to be published. 2. On the. 
Properties of the Multiple Comparisons Test, 1951, to be published.) has also presented 
a logical procedure. The student of this problem is advised to examine the methods 
presented by Tukey and Duncan. 

For a fairly large number of means (say 25) and with an absence of noticeable gaps 
we can take the general mean Х and calculate X + 2.8s,, where Sm is the standard 
deviation of a mean determined from the error mean square of the analysis. All those 
means exceeding X + 2.85,, will constitute а small group that are significantly higher 
than the remainder. In plant breeding work where the test forms the basis for the 
selection of high yielding types, this procedure is reasonably satisfactory. 

As a general rule in the interpretation of differences among means it is impossible 
and quite impractical to discard entirely the information that can be obtained from a 
1 test. It is necessary, however, to observe three points quite closely. 

1. Do not apply the / test to any differences between means unless the F test indicates 
that real differences are present. This does not mean that an F exceeding the 5% 
point must necessarily be obtained, but it must certainly exceed the 10% point. , 

2. Do not apply the / test simply because a difference looks big and you think it 
should be significant. There must be some logical reason for wishing to make the 
comparison, quite independent of the results obtained. 

3. It is valid to make any г test provided that the test is preconceived at the time the 
experiment is designed. 
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The chief advantage of the randomized block experiment over the 
ungrouped design as outlined in Section 2 is that it makes an attempt to 
control the error by identifying a portion of the total variation with the 
replicate means. Any variation due to the replicates which, by virtue of 
the design, is orthogonal with the treatments is therefore independent of 
the treatment variation. Assuming that in a certain experimental area 
there is given total variation, any portion that is represented by the 
replicates is therefore virtually removed from the experiment with a 
consequent increase in precision. 

The effect of the blocks in increasing the precision of an experiment. 
can be demonstrated very neatly by breaking down each mean square 
into the variance components being estimated. For the ungrouped and 
randomized block experiments, the variance components are as follows. 


Ungrouped Experiment Randomized Blocks 
Variance Variance 
Components Components 
MS Estimated DF М8 Estimated DF 
Error E, о? c(r — 1) | Error E; [^d (r— 1/с- 1) 
Treatments En 01° + ro"? c—1 | Treatments Ep — oj! + roe c—1 
Replicates E, 0,2 со? r—-1 


Assuming that the total sum of squares is the same for the two experi- 
ments, we have estimates of the sums of squares in each line by multiplying 
the variance components by the degrees of freedom, giving, with some 
simplification, 

(rc — По) = (rc — 1)o + e(r — 1)6,2 
or 


er — 1) 


gie g———— o 2 7 
E 2 re—] " 

showing that o, must be less than сү? in the population sampled unless 

o,” = 0. Since E, is an estimate of o,2, and E, is an estimate of o, it 

follows that, on the average, E, < Е. Similarly, for the F values we have 


21 2 2ж 
: о? + ra, ro 
Ungrouped experiment F ~ at =14— 
бү о 
2 2 2 
< O; ro, ro, 
Randomized blocks ЕА EF р-ға 
с. 2 6. 2 
2 2 


* The symbol ~ is to be read “із an estimate of.” 
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and it follows that, on the average, F for the randomized block experiment 
will be larger than F for the ungrouped experiment, unless o, = 0. 
Equation (7) can be used after a randomized block experiment has been 
conducted to obtain an estimate of what the error would have been for 
an ungrouped experiment. For the randomized block design, E, is an 
estimate of 0,2, and E, is an estimate of оз? + со,2, or, in a different form, 
(E, — Е,)/с is an estimate of op. Substituting in (7) gives 


(r— DE, + c(r — Е, Л 8 
er—1 


The success of the randomized block experiment depends to a consider- 
able extent on the skill of the operator in setting up the design so that the 
replicates will coincide with some major source of variability. In a field 
plot experiment this can sometimes be accomplished by making the 
replicates agree with topographical features of the land or known fertility 
trends. In general, it will be found desirable to make the blocks as 
compact as possible, especially when there is no previous knowledge of 
the nature of fertility trends. The ideal randomized block field experi- 
ment on the average is one in which the replicates are square and the plots 
are strips running the full length of one side of the square. In other types 
of experiments the replicates can be identified usually with sources of 
variability corresponding with time, position, classification of the experi- 
mental units, and so forth. In a feeding experiment the animals may be 
divided into replicates according to weight, age, or some other factor. 
In laboratory determinations time trends are frequently important, and 
it is then customary to make one determination for each combination 
being tested during a given period of time. Each period then becomes a 
replication, and major time trend effects are removed. It is impossible 
to list the various ways in which treatments can be grouped in order to 
obtain error control. The experience of the experimenter must be relied 
upon to furnish the necessary information. In any event it is very 
unusual to find a condition for which grouping of the treatments will not 
bring about some measure of error control. It should be borne in mind 
when planning an experiment that the advantages gained by correct 
design, with particular reference to the exercise of error control, means 
that the additional labor and costs are practically negligible. The gain 
in precision is a function of the design itself and is not necessarily related 
to the expenditure of effort. Frequently, the best design is the easiest 
design to carry out and there is an actual economy involved over and 
above obtaining a higher degree of precision. 

The randomized block experiment has certain advantages over more 
complex designs. Іп (һе first place, it can be adapted to a varied number 
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of treatments and replications without any restrictions. In the second 
place, it is always possible to separate the sum of squares for error into 
components corresponding to particular treatment effects. This feature 
is of distinct advantage when the error is not homogeneous and it is not 
strictly correct to use a pooled error for a comparison of all treatment 
combinations. This problem is dealt with in detail by Cochran [3]. 
The example given by Cochran, and originally by Bartlett [1] is reproduced 
in Example 11-1, except that one of the treatments has been omitted in 
order to illustrate the change that this requires in the calculations. 

The effect of grouping the treatments into replicates can be indicated 
very neatly by formula (8) which is repeated below in the form given by 
Cochran and Cox [4]. The formula is 


nE, + (т + пдЕ, 
ny +n, +n, 


where ль, Ma and n, are the degrees of freedom for blocks, treatments, 
and error, and Е, and Е, are the mean squares for blocks and error. 
Е, г, is an estimate of what the error mean square would have been in an 
ungrouped experiment using the same set of experimental units. 

One characteristic of the randomized block experiment, especially with 
reference to field plot experiments, is that with increasing numbers of 
treatments per block the efficiency of error control decreases. As will be 
noted in later chapters, when the number of treatments is large (approxi- 
mately 20 or more) there are other designs that are more efficient than 
randomized blocks. 

4. Ехатріе 11-1. Analysis of a randomized block experiment in which 
the error sum of squares is split into components corresponding to specific 
treatment effects. The data are for numbers of poppies in plots of oats 
subjected to different treatments (Bartlett [1]. 


Ee, = 9 


TABLE 11-1 


NUMBERS OF POPPIES IN OATS 


Treatment 

1 2 3 4 5 Тош 

|Ы да 770 15 17 | 1185 

о aD ОЗ 877 7 31 | 1013 
Кері 

E Gs OTT 31S 1T a alas | 1040 

ЕСЕСІ 7% 48 16 | 808 

Total 1652 199 547 317 151 | 4046 


Mean 413.0 3948 86.8 79.2 37.8 
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The data show that (1) and (2) are much higher than the remaining three 
treatments, and also that (5) is lower than (3) and (4). On making a 
simple analysis we have: 


55 DF MS 
Replicates 14,465.8 3 
Treatments 548,005.2 4 137,001 
Error 35,851.2 12 2,987.6 
Total 598,322.2 


The standard error of a single plot is +/2987.6 = 54.66; therefore, since 
tat the 5% point for 12 degrees of freedom is 2.179, a significant difference 
between the 2 means must be as large as (54.66/У4) х 4/2 x 2.179 
= 842. It rather surprises us to find that the mean difference between 
treatments (3) and (5) does not approach significance according to this 
test; however, a casual examination of the data indicates that the errors 
are not homogeneous and emphasizes that the error sum of squares 
should be broken down into components appropriate for different 


comparisons. 
The most logical procedure is to break down the error into components 


corresponding to the comparisons: 


(а (0— 0) 1 DF 
(b) Between (3), (4), (5) 2DF 
(с) 0) + Q1 — 218) + (4 + OI 1 DF 

Total 4 DF 


using the methods described in Chapter 5, Section 12. In order to do 
this we must list the first and third comparisons for each replicate. The 
second can be worked out by treating the data containing (3), (4), and (5) 
as a separate experiment. 


(a) (c) 

(1) = 2) 3(1) + @1— 2103) + (4) + 01 
100 2510 

= 2294 
58 1400 

- 65 1859 


73 8063 
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The figures listed under (a) will give us the sum of squares for the treatment 
comparison (1) — (2) and also the corresponding error term. We have 
100? + 20° + 582+ 652 73? 


Error term 2 3 8994.5 — 666.1' = 8328.4 


(1) — (2) = 666.1 
where the divisor of the first term is 12 + 12 = 2, and that of the second 
term is 4(12 + 1?) = 8. 

For (c) the sums of squares required are 
2510? + 2294? + 1400? + 18592 8063" 
30 120 

565,947.2 — 541,766.4 = 24,180.8 

За) + (2)] 2])3( + (4) + (5)] = 541,766.4 


Here, Ше divisor of the first term is (3? + 3° + 2? + 2? + 2?) = 30, and 
that of the second term is 4 x 30 = 120. Treating (3), (4), and (5) as a 
separate experiment, we have 


Error term 


SS DF 
(3), (4), (5) 5,572.7 2 
Replicates 10,310.2 3 
Error 3,342.0 6 
Total 19,224.9 11 
The complete analysis is: 
SE of a 
55 DF MS Single Plot 
(1) = 0) 666.1 1 666.1 
Error 8,328.4 3 2,776.1 52.6 
(3), (4), (5) 25572768 /2 2,786.4 
Error 3,342.0 6 557.0 23.6 
ЗІ) + (21- 210) + (4) + (5)] 541,766.4 1 541,766.4 | 
Error 1 24,180.8 3 8,060.3 89.8 


The errors for (1) — (2) and for 3[(1) + 2)] — 210) + (4) + (5)] are 
significantly greater than for (3), (4), and (5). This is verified by calcu- 
lating F values as follows. 


=a = 498 5% point = 4.76 


8060.3 


= 30-145 5% point = 4.76 
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This example demonstrates a fundamental principle of the analysis of 
variance with respect to tests of significance. The treatment comparisons 
must be such that they can reasonably be expected to have the same , 
variance. In a randomized block experiment.where there is evidence of 
heterogeneity of the errors it is frequently possible by means of the 
method illustrated to break up the treatment comparisons into groups 
that can be measured with a fair degree of accuracy. 

5. Missing values in a randomized block experiment. The data from a 
randomized block experiment сап be set up in the form of an ғ x c table, 
where r corresponds to the number of replicates and c to the number of 
treatments. The estimation of missing values and the corresponding 
analysis is that for an r X c table as outlined in detail in Chapter 14, 
Sections 7 and 8. 

6. Latin squares. The Latin square differs from the randomized block 
design in that the treatments are arranged in complete groups in two 
directions, the two classifications being orthogonal to each other and to 
the treatments. Thus, if we have 4 treatments they can be arranged in 
rows and columns as follows: 


ABCD 
DABC 
CDAB 
BCDA 


where each row and each column contains a complete set of the treat- 
ments. In order to take advantage of this arrangement for the control 
of error the groups in the rows and columns must be made to correspond 
with major sources of variation to which the experimental units are subject. 
In field plot experiments, the Latin square is usually laid out in the 
conventional square with the rows and columns corresponding to possible 
fertility trends in two directions across the field. When there is a major 
fertility trend in one direction, it may be more effective to lay out the 
Latin square as illustrated below for a 4 x 4 square. 


Row 1 2 3 + 
Column 1.29494 1412 1304 172 3 4 1234 
Treatment ABCD DABC CDAB BCDA 


It will be seen that with this arrangement the row groups will eliminate 
the major fertility trend and the column groups will eliminate fertility 
trends within row groups that are alike. 

In other types of experiments the rows and columns may be made to 
correspond to different sources of error as in an animal feeding experiment 
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where the column groups may correspond with initial weight and the row 
groups with age. 

It should be emphasized at this point that the Latin square is not 
suitable when the number of treatments is large. In the first place, there 
must be as many replicates as there are treatments, and if we do not wish 
to go beyond 10 replications it is obvious that we cannot compare more 
than 10 treatments. Furthermore, when we have as many as 10 treat- 
ments it may be difficult to allocate the rows and columns to sources of 
variability in an efficient manner. When the number of treatments is 
small, the Latin square does not provide a sufficient number of replicates 
or sufficient degrees of freedom to give a reliable error, but this can be 
overcome by repeating the square as many times as desired. 

If we let r represent the number of treatments, rows, and columns in a 
Latin square, the form of the analysis is: 


SS DF 
T(R) С? 
R Pru = 
ows 2 А a r-1 10 
r (с) G 
Col eu = 
‘columns 2 "ree r-1 11 
7 3 2 
Treatments > Ge S r—1 12 
Pr r 
Error By difference (r— 1)(r— 2) 13 
Total r-l 


where R, C, and Т, represent row, column, and treatment totals, and С 
the grand total. 

For Latin squares it is frequently of interest to study the efficiency in 
comparison with the corresponding randomized block design in which 
either columns or rows are omitted. The approximate formula for 
estimating the mean square for error is given by Cochran and Cox [4] as 
follows. 

E n,E, + (n, + n)E, 


n, +m +n, М 


where Е, is the mean square for rows, Е, is the mean square for error, 
and n, n, and п, are the degrees of freedom for rows, treatments, and 
error. This formula gives an estimate of the error expected if rows are 
removed. It is obvious that the effect of removing columns could be 
obtained by substituting E, for E. 
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7. Randomization of a Latin squat: seed for randomization 
applies in a Latin square as in all exp: esigns. This involves 
placing the treatments at random in pos? he square, subject to the 
restriction that a treatment can occur опіу -nce in а row or column. 
The basic principle as stated by Fisher [7] : -hat “each plot has an equal 
probability of receiving any of the possib! treatments, and each pair of 
plots not in the same row or column has the same probability of being 
treated alike.” Yates [13] discusses in detail the procedures necessary for 
the randomization of Latin squares from the 3 x 3 to 12 x 12. In 
general, if we have all the possible arrangements for a Latin square of 
given dimensions, the process of randomization involves drawing one of 
these at random. For example, for the 3 x 3 square: 


there are only 12 possible arrangements as given below. 


() (2) (3) (4) (5) © 
ABC ACB BAC CAB BCA C B A 
BCA BAC СВА АВС САВ ЖСБ 
САВ СВА ACB BCA ABC B ASG 

(7) (8) (9) (10) (11) (12) 


A B.C A CB ВАС САВ 
C A B C BA AC B BCA 
BCA BAC CBA ABC 


Q x Ы 
оо 
wars 


and if we require a square for actual use it is necessary only to select one 
of these at random. 

With larger squares the problem is more complex. For the 4x4 
square Yates [13] gives the following “reduced squares” in that the first 
row and column contains А, B, C, D always in the same order, and shows 

that randomization is obtained by selecting a reduced square at random 
and a random permutation of all the columns, and all the rows except 
the first. 
а) 0) (3) (4) 


A BUG р AGB GSD. ABCD ABCD 
БАС BCDA BDAC BA DC 
CDBA CLD) АКВ CADB CDA B 
DCAB DABC DCBA DCBA 
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Since we usually represent treatments by numbers, the squares are repro- 
duced, replacing 4, B, C, D by 1, 2, 3, 4. 


a) 2) (3) (4) 


1234 1234 1925394; 1234 
2143 2341 2413 21 473 
3421 3412 3142 3412 
4312 4123 4321 4321 


and we shall illustrate the procedure of selecting a square at random. 
Opening a table of random numbers, Table A-7, we decide arbitrarily 
before looking at the table to take the first 2-digit number in, say, the 
sixth column and twentieth row. Suppose that this is 86, and dividing 
by 4 gives a remainder of 2. This indicates that our design is to be made 
up from the second square. The next step is to draw a random permuta- 
tion of the numbers 1, 2, 3, 4. Starting at random in the table and 
proceeding systematically, we put down the numbers 1, 2, 3, 4 as they 
occur. Suppose we get 4, 3, 1, 2; then square (2) is written down with 
the columns in that order. 


4312 
1423 
2134 
3241 


For the next step a random permutation of the numbers 1, 2, 3 is required. 
We get 2, 3, 1, and these decide the arrangement of the last 3 rows. The 
final square is 

4312 

2134 

3241 

1423 


For the 5 x 5 squares Yates gives 2 reduced squares as follows. 


12345 12345 
21453 23451 
35124 34512 
43950102) 45123 
54231 Ө: 152-354. 


(1-50) (51-56) 
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Note the numbers below the squares. In order to select a square at 
random we take a number at random from 1 to 56. If the number is 
between 1 and 50, inclusive, we use the first square. If the number is 
between 51 to 56 inclusive, we use the second square. 

After selecting a square, randomization is completed by permuting all 
columns, all rows, and all numbers. Suppose that the first square is 
selected, then from a table of random numbers we select 3 random 
permutations of the numbers 1, 2, 3, 4, 5. Let these be (1, 5, 4, 2, 3), 
(4, 5, 2, 1, 3) (3, 1, 5, 4, 2). Permuting the columns gives 


15423 
23514 
34251 
42135 
51342 


and permuting the rows gives 


42135 
51342 
23514 
15423 
34251 


To permute the numbers it is convenient to write down 


12345 
31542 


the second row being our random permutation. This shows that 1 is 
changed to 3, 2 to 1, еіс. Our final square is 


41352 
23541 
15234 
32413 
54123 


The 6 x 6 squares have been dealt with in the same manner, and the 
appropriate squares can be found in Yates [13] and Fisher and Yates [9]. 
The higher squares have not been treated in detail, but Yates points out 
that, if the following squares are randomized by selecting a random 
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permutation of the rows, columns, and numbers, a sufficient number of 
the possible squares are represented for all practical purposes. 


1234567 12345678 
2456173 23156487 
3165214 31478562 
4512736 46731825 
5327641 58267314 
6173452 64812753 
7641325 15683241 

87524136 
(7 x 7) (8x 8) 

123456789 1234567890 

235749618 2715836904 

346187952 3807625149 

481265397 4179053268 

572938461 5680971423 

698524173 6523490781 

769312845 7962148035 

857691234 8396704512 

914873526 9041382657 

0458219376 
(9 x 9) (10 x 10) 


The 11 x I1 and 12 x 12 squares are also given by Fisher and Yates [9]. 

It is well to keep in mind that, regardless of the availability of squares 
for which we can take random permutations of the rows, columns, and 
numbers, we can always select a square at random by trial and error. 
This involves trying out various random permutations and selecting those 
that will fit. For example, if we need a 6 x 6 square, we can take any 
random permutation for the first row. Suppose that we get 


326451 


Then in selecting the second permutation only those numbers are taken 
that do not occur in the same column. While setting up the fourth and 
fifth rows it may be found that only certain combinations of numbers will 
fit, and these are filled in without reference to the table of random 
numbers. The sixth row is of course fixed when the fifth is complete. 
Occasionally, combinations are obtained that will not work, and it is then 
necessary to eliminate one or more rows and start again. 
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8. Missing values in a Latin square. The formula for the estimation of 
a missing value as developed in Chapter 14, Section 19, is 


қЕ--С--Т7)-26 
(r— 1r — 2) 


where R, C, T, and G are the row total, column total, treatment total, and 
grand total containing the missing value. For comparing a mean con- 
iaining an estimated value with a mean for a treatment for which the 
results are complete, the approximate standard error is— 


3E d 
J | i 
where s? is the error mean square. 
Corrections to the treatment mean square for an exact test of significance 
can be worked out by the methods described in Chapter 14, Section 10. 
For testing the significance of mean differences bétween treatments 
containing missing plots, apply the rule given by Yates [12] that the 
effective replication is as follows, assuming that the treatments are A and В. 
For each value of A determine 


15 


B present in row and column, replications — 1 
B present in a row or column, replications — $ 
B missing, replications = $ 
A missing, replications = 0 


Similarly calculate the effective replication for B based on the presence or 
absence of A in corresponding rows and columns. For example, in Table 
11-6, comparing rations 1.9 and 6.4 and beginning with the first row, 


we have E 
1.9 О+1+0+1+1+1= 4 


64 Ф+#+#+1+1+1=4 


Then the standard error of a mean difference between these two 


rations is г 
1 1 

2 = 
J x ( * a 


Yates [15] and Yates and Hale [16] have developed methods of analysis 
for the situations where one or more entire rows, columns, or treatments 
are missing. 

9. The Graeco-Latin square. This design is of interest chiefly from the 
theoretical standpoint, although Dunlop [6] has pointed out how it could 
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be applied іп ап actual experiment. Fisher [7] states: “The principal use 
of Graeco-Latin and higher squares consists in clarifying complex 
combinatorial situations * - *." This is particularly true of the designs 
referred to later under the heading of Incomplete Block Experiments. 

In Latin squares it is possible to attach Greek letters to the Latin letters 
in such a way that both Greek letters and Latin letters are orthogonal to 
each other and to the rows and columns. Fisher [7] gives the following 
3 x 3 Graeco-Latin square. к 


Ах ВВ Су 
Ву Сх АВ 
СВ Ау Ba 


and it will be noted that A has attached to it the three Greek letters, 
о, В, y. The same is true for B and С. Thus, if ж, f, and y represent 
some effect in the experiment, this effect will be orthogonal to the treat- 
ments represented by 4, B, and C. Similarly, we note that the Greek 
letters, like the Latin letters, occur once in each row and once in each 
column. 
In a 3 x 3 Latin square the degrees of freedom can be apportioned as 
follows. 
Rows 
Columns 
Treatments 
Error 


NNNN 


and obviously the 2 degrees of freedom for error can be identified with 
the 2 for Greek letters in a Graeco-Latin square. Therefore, а 3 x 3 
Graeco-Latin square would not be a possible design for an actual experi- 
ment in that there would be no degrees of freedom for estimating the 
error, unless we use 2 or more squares. In a 4 x 4 Latin square the 
degrees of freedom may be partitioned as follows. 


Rows 
Columns 
Treatments 
Error 


Awww 


and for a Graeco-Latin square the outline is 


Rows 
Columns 
Treatments 
Greek letters 
Error 


чы دیا‎ шош о шш 
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This opens up the possibility of a continuation of the process, for example, 
by using numbers as subscripts to the Greek letters, giving finally 


Rows 3 
Columns 3 
Latin letters 3 
Greek letters 3 
Numbers 3 


Fisher [7] gives the following square as an example. 
Аа Bp Cy Dd 
By Азд Da СВ 
С.д Dy А,В Bg 
Dp Ce В,6 Азу 


The principle is suggested by these squares that a square of r dimensions 
can be divided into r + 1 mutually orthogonal sets of r— 1 degrees of 
freedom, but unfortunately this is not true as exemplified particularly by 
the 6 x 6 square for which even a Graeco-Latin square does not exist. 
If the side of the square is a prime number or any power of a prime 
number, however, the /2-- 1 degrees of freedom for the ғ x r square can 
be divided into r + 1 orthogonal sets of r — 1 degrees of freedom. This 
is a particularly important proposition in connection with the principles 
underlying incomplete block experiments. Further, for r? objects 
arranged in ап r x r square, where r is a prime number, the division can 
be made by a simple mechanical procedure. Thus іп a 3 x 3 square the 
9 objects can be represented by the numbers 


П 12 13 
21 22 23 
31 32 33 


The rows and columns give the first two divisions into orthogonal sets. 
The third can be taken from the diagonals, thus: 


11 224893 
21 32 13 
31 12 23 


and the fourth from the diagonals of the last square as follows. 


11 32 4223 
21 12 33 
31 22 13 


If we write a fifth set, the original square is regenerated, showing that all 
possible sets have been written. 
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If the square is not that of a prime number or a power of a prime 
number, there is considerable difficulty in writing out the orthogonal sets, 
except for the 4 x 4, which can be done simply by trial and error. How- 
ever, Fisher and Yates [9] have given completely orthogonalized 8 x 8 
and 9 x 9 squares. These squares can also be found in Cochran and 
Cox [4]. The higher squares have not been investigated. 

10. The cross-over design. This design may be regarded as a variant 
of the Latin square. If there are 2 treatments 4 and В in 10 replicates, 
the layout may be as follows. 

Column 


The pairs of treatments are at random subject to the restriction that A 
and B must occur an equal number of times in each row. The corres- 
pondence with the Latin square is seen by comparison of the above with 
five 2 x 2 Latin squares. 


Row 1 2 3 4 5 
1 A B A B B A A B B A 
2 B A B A A B B A A B 


Comparing the analyses of these two designs, we have: 


Cross-Over Design Latin Squares 
DF DF 
Treatments 1 Treatments 1 
Rows 1 Squares 4 
Columns 9 Rows 5 
Error 8 Columns 5 
Error 4 
Total 19 Total 19 


and it is clear that the cross-over design may have an advantage in that 
more degrees of freedom are available for the estimation of error. Its 
particular value, however, would seem to be under conditions in which a 
consistent difference is known to exist between the rows throughout all 
the pairs. In biological assay work.the rows may be 2 animals that are 
treated consecutively as indicated by the columns, the latter representing, 
therefore, the time element in the experiment. For further details on this 
design, see Cochran and Cox [4]. 
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11. Efficiency in experimental design. In order to compare 2 experi- 
mental designs or even to think in terms of the comparison of designs 
we require some sort of basis of comparison. Fisher [7] has discussed 
this from the standpoint of the relative amount of information furnished 
by a given experiment and suggests that the inverse of the error variance 
be taken as a general measure. Thus, if one design gives an error variance 
of 2 and another design a variance of 1, the relative amounts of informa- 
tion given by the 2 designs are 1/2 and 1. Although there are other ways 
of setting up such standards this is a very practical method because it can 
be interpreted directly in terms of the amount of replication required in 
order to obtain a given degree of accuracy. Since the variance of a mean 
is estimated by E/r, where E is the error mean square and r the number 
of replications, it follows that for the 2 designs referred to above the former 
would require twice as many replications as the latter to give equal 
precision. 

If we have 2 designs, one giving an error mean square of E, and the 
other Е, a measure of the relative efficiency of the first design as compared 
to the second is given by the ratio E,/E,. Suppose that this equals 2, 
and the second experiment has 6 replications. The first experiment 
would require 12 replications to obtain the same degree of accuracy. In 


general 
(2) te | 
2 


where ғ is the number of replications in experiment 2, gives the number 
of replications required for experiment 1 in order to attain equal precision. 

In comparing 2 designs for the same experiment it is usual to have to 
take into consideration a difference in the degrees of freedom available 
for the estimation of error. It will have been noted that error control by 
means of blocks, rows, columns, etc., involves corresponding reductions 
in the degrees of freedom for error. Thus, in a completely randomized 
experiment for 8 treatments with each treatment tested in 3 experimental 
units, there are 16 degrees of freedom for error. The same experiment set 
up in randomized blocks would have 14 degrees of freedom for error. 
On looking up the table of t at the 5% point, we find that the corresponding 
values of 10, are 2.12 and 2.14; therefore, the randomized block error 
will have to be proportionately smaller in order to make up for this 
difference. This whole question is not quite so simple as jt may appear 
at first, because in comparing / values at the 5% point we are considering 
only one level of significance. At the 1% level we have too, = 2.92 for 
n = 16, and too, = 2.98 for n = 14, and the ratio of these / values is not 
quite the same as at the 5% point. Cochran and Cox [4] discuss this 
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point in some detail, giving a table showing the effect of the levels of 
significance on the relative efficiencies of experiments with different 
numbers of degrees of freedom for error. Fisher [7] has developed a 
formula that enables us to make a comparison of the efficiency of 2 
experiments and to take into account the effect of the degrees of freedom 
throughout the full range of /. For 2 experiments with error mean 
squares E, and E, and degrees of freedom л, and п, we can take 


(na + To + 3)E, 
(m + 1)(л» + 3)E, 


as an expression for the relative efficiency of the second experiment as 
compared to the first. The first part of the formula makes a correction 
for the difference in degrees of freedom. This formula is applied in 
Examples 11-2 and 11-3. 

12. Example 11-2. А randomized block experiment. Table 11-2 gives 
the yields of 6 wheat varieties in an experiment in 4 randomized blocks. 


18 


TABLE 11-2 


YIELDS OF WHEAT VARIETIES 


Variety Block 

A B с D E T Total 
27.8 30.6 277 16.2 16.2 24.9 143.4 
27.3 28.8 22.7 15.0 17.0 22.5 133.3 
28.5 31.0 34.9 14.1 17.7 22.7 148,9 
38.5 39.5 36.8 19.6 15.4 26.3 176.1 


Block 


әже 


Variety total 122.1 1299 1221 64.9 66.3 96.4 601.7 


The nature of the yields in this experiment gives some indication of 
heterogeneity with respect to the errors, For example, the ranges in yield 
for the 6 varieties in order are 11.2, 10.7, 14.1, 5.5, 2.3, 3.8; therefore, it 
is possible that error components should be isolated. However, this is 
left to be done by the student and is given as one of the exercises at the 
end of this chapter. 

Calculating the sums of squares we get: 


Total = 16,460.05 — 15,085.12 = 1374.93 
Blocks = 15,252.48 — 15,085.12 = 167.36 
Varieties — 16,147.87 — 15,085.12 == 1062.75 


Error = 1,374.93 — 167.36 — 1,062.75 = 144.82 
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We then have the analysis of variance: 


SS DF MS F 5% Point 
Blocks 167.36 3 55.79 5.78 3.29 
Varieties 1062.75 5 212.55 220 ِ 2.90 
Error 144.82 15 9.655 
Total 1374.93 23 


It is of interest to test the significance of the block effect іп order to note 
the extent to which error control has been realized. 

In order to test the significance of the differences between randomly 
selected pairs of variety means, we can determine the minimum significant 


difference as follows: 
Since fo,9; = 2.13 for 15 degrees of freedom, we have 


2.13 x EE = 4.68 


Or we may use the ¢ test in the conventional manner, calculating first the 
standard error of a mean difference from 


F x ж UU 


Then if we wish to compare varieties D and F for which the means are 
16.2 and 24.1, we have 
24.1 — 162 
UL NOE ERR 3.59 
and this difference is significant since the 5% point of ¢ for 15 degrees of 
freedom is 2.13. 

The relative efficiency of this randomized block experiment as compared 
to an ungrouped experiment can be determined by first applying the 
formula given by Cochran and Cox [4] in order to obtain an estimate of 
the error of the ungrouped experiment. Here we have 


3 x 55.79 + 20 x 9.655 
Ес, 23 


15.67 


The relative efficiency is given approximately by 
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This is reduced slightly due to the fact that in the completely randomized 
experiment there are 18 degrees of freedom for error as compared to 15 
for the randomized blocks, but this effect is small. We can estimate this 
effect by using formula (18) described in Section 11. It is 


(15 + 1)(18 + 3) 336 


0.982 
(18+ 1015 = 3) 342 


The corrected relative efficiency is then 


1.62 x 0.982 = 1.59 


13. Example 11-3. Analysis of a Latin square experiment. The plan 
below is for a Latin square experiment to test the efficiency of methods 
of dusting with sulphur in order to control stem rust of wheat. The key 
to the treatments is given with the plan. 


023 405 Key to Treatments 
1 A A = Dusted before rains 
2 Е В = Dusted after rains 
Row 3 B C = Dusted once each week 
4 D D — Drifting, once each week 
5 с E = Not dusted 


All applications were 30 pounds to the acre at each treatment. Drifting 
means that the dust was allowed to settle over the plants from above, as 
in airplane dusting. The plot yields in bushels per acre are given in 
Table 11-3, where the figures in the table correspond to the position of 
the plots and treatments in the plan. 


TABLE 11-3 


YIELDS IN A LATIN SQUARE EXPERIMENT 


Solin Row Treatment Treatment 

1 2 3 4 5 Total Total Mean 

1 49 64 33 95 118] 359 А 342 6.84 

2| 93 40 62 51 54] 300 B 32.3 6.46 

Row 3 76 154 65 60 46] 401 C 65.6 13.12 
41723 76 182 86 49 |-396 Т 39.8 7.96 

5| 93 63 11.8 159 76| 509 Е 24.6 4,92 

Column 
364 397 41.0 45.1 34.3 | 196.5 196,5 


Total 
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The calculation of the sums of squares is given below. 


Total = 1829.83 — 1544.49 = 285.34 
Rows = 1591.16 — 1544.49 — 46.67 
Columns = 1558.51 — 1544.49 = 1402 
Treatments — 1741.10 — 1544.49 = 196.61 
Error 285.34 — 46.67 — 14.02 — 196.61 28.04 
Then the analysis of variance is: 

SS DF MS F 5% Point 
Rows 46.67 4 11.67 4.99 3.26 
Columns 14.02 4 3.505 1.50 3.26 
Treatments 196.61 4 49.15 21.0 3.26 
Error 28.04 12 2.337 

Total 285.34 24 


The column mean square is not significant. This is probably due to the 
Shape of the plots. They were long and narrow; hence the columns are 
narrow strips running the length of the rectangular area. Under these 
conditions the Latin square may have little advantage on the average over 
a randomized block plan. 

To compare means of pairs of treatments we look up 15,5 for 12 degrees 
of freedom — 2.18; then a significant difference between 2 means must 
be equal to 


and on examining the means we see that the significance of the treatment 
mean square is due chiefly to treatment C which is much more effective 
than any of the others. 

Since the columns in this experiment contribute less to the control of 
error than the rows, it is of interest to apply the formula given by Cochran 
and Cox [4] to estimate the error mean square that would have been 
obtained if this had been:a randomized block experiment with the rows 
as replicates. 

4 x 3.505 + 16 x 2.337 


3 = 2.571 
B 20 
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The relative efficiency is measured by 2.571/2.337 — 1.10. The effect of 
the reduction in degrees of freedom can be estimated from Fisher's 
formula as given in Section 11. We have 

(12 + 1)(16 + 3) 


6+ D(2 + 3) = 0.9686 


and 0.9686 x 1.10 — 1.07, or the gain is actually only 7%. 


14. Exercises. 


1, Make up designs showing the complete randomized layout for the following types 
of experiments. 

а. Randomized blocks, 10 treatments in 4 replications. 

8. Latin square (7 х 7). 

с. Cross-over design, 3 treatments and 9 columns. Compare the outline of the 
analysis obtained for this experiment with that of three 3 x 3 Latin squares. 

2. Analyze the randomized block experiment data given in Table 11-4 for a feeding 
trial conducted by Crampton and Hopkins [5]. Determine the standard error of a 
difference between 2 randomly selected means, and make а г test of the difference 


between the means of any 2 treatments. Error mean square = 243.6. 
TABLE 11-4 
GAINS IN WEIGHT FOR FEEDING TRIAL (PouNps — 100) 
Lot 


1 2 3 4 5 
65 68 64 85 101 
56 80 26. 95 89 
59 80 89 86 73 
67 66 38 101 93 
70 70 53 65 64 
46 6l 9015/75 60 
3077 71 60 87 100 
51 бе ОТ T1942 
64 79 2 66 84 
10258 91 55 65 49 


Replicate 


900-00 +» ою е 


3. Make an analysis of variance for the 5 x 5 Latin square experiment given in 
Table 11-5. Use formula (14) to test the error control effect due to rows and columns 
separately. Error mean square — 399.35. 

4. Take the data of Table 11-2 for a randomized block experiment and divide the 
error sum of squares into components representing: 


a A, B,C 2 DF : 
b D,E 1 DF 
с D+ E—2F 1 DF 
d (A+ B+C)-(D+E+F) 1DF 


and test each component against the corresponding error. 
Error mean squares: a = 6.670; b = 5.687; c = 0.990; d = 28.26. 
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TABLE 11-5 


YIELD or POTATOES (POUNDS — 250) FOR LATIN SQUARE FERTILITY EXPERIMENT, 
ROTHAMSTED REPORT, 1927-1928 


130 120 118 4395137) 2C IS е) eo 25 
70 81 104 84. 72 15 1C 720 28 0 
120 109 18 82 71 25/7 267 70 1C- 186 
42 97 72 97 48 -0 25 15 2C. 4€ 
48 32 72 71 45 1Cy 70 28 IS 2С 


Treatments: 0 — Control 
1C — Cyanamide — 1 cwt sulphate of ammonia 
2C — Cyanamide (double dressing) 
1S = Sulphate of ammonia (1 cwt) 


2S = Sulphate of ammonia (double dressing) 
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CHAPTER 19 


Factorial Experiments 


1. Principle of factorial experimentation. As pointed out in Chapter 
11, Section 1, it is necessary for an experiment to have sufficient scope in 
order to obtain reliable information on the questions asked. Suppose 
that we are planning an experiment to test varieties of oats for an area 
where oats are grown on summer fallow as well as a second crop in the 
rotation. Under dry conditions the second crop growth may be less 
favorable than on fallow, and varieties may behave differently. An 
obvious procedure to overcome this difficulty is to have one-half the plots 
on fallow and one-half on stubble land. А further possibility is that the 
optimum rates of seeding will be different for the two conditions, and 
this may be different for the varieties. This calls for a variation in the 
rate of seeding. We have, then, 3 factors: (a) variety, (b) cultural 
Practice, and (с) rate of seeding; and at this point our problem is to 
decide on the detailed design of the experiment. One might try to test 
each of the factors separately, holding all other factors constant in a 
given experiment, but with a little thought it will be clear that such an 
experiment may not give the information required. For example, it may 
be that only one variety is to be recommended; therefore we want to 
know which variety will give the best results under all conditions, The 
logical procedure is to vary all factors simultaneously, that is, within the 
framework of the same experiment. When we do so, we have what is 
now widely known as a factorial experiment. 

One of the advantages of the factorial experiment therefore is that it 
increases the scope of the experiment and its inductive value, and it does 
so mainly by giving information not only on the main factors but on their 
interactions. A further advantage is that the various levels of 1 factor 
Constitute replications of other factors and increases the amount of 
information obtained on all factors. For example, if we have 6 varieties 
in a test, each at 3 rates of seeding, in 1 replication there will be 3 plots 
of each variety, 1 for each rate, and in 4 replications there will be a total 
of 12 plots. Similarly for апу 1 rate of seeding there are 6 plots in a 
replicate and a total of 24 plots in all. It is for this reason that replication 
can be kept to the minimum, and with a sufficient number of factors and 
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levels it is possible to design a quite satisfactory experiment with one 
replicate. 

The introduction of factors and levels in an experiment is of course 
limited by space and cost. If we study all possible combinations in an 
experiment with 4 factors, each at 3 levels, we have 34 = 81 combinations. 
The addition of another factor at 3 levels brings the total to 243, at which 
point the experiment becomes bulky, and since the blocks would be quite 
large they would not operate efficiently in error control. However, the 
device known as confounding can be utilized to increase the efficiency of 
error control, and this will be discussed in following sections. Further- 
more it is always possible by a wise selection of factors and levels to keep 
the experiment within reasonable limits. In extreme cases where a large 
number of combinations seem to be called for, the device known as 
fractional replication can be utilized and this also will be discussed 
briefly. 

2. Design and analysis of a simple factorial experiment. A simple 
example of a factorial experiment would be one with 6 varieties, each at 3 
rates of seeding, making a total of 18 treatment combinations. It will 
make the general principle somewhat clearer to imagine the factors and 
levels outlined in a 6 х 3 table as follows, where the rates refer to pounds 
per acre. 

Varieties (B) 
30 ab, ай ab, ab, abs а% 
Rates of seeding (A) 60 ax, ab, аф ар aps а, 


90 ау, ауа ab, аҙ, а аб 


In this table the letters and subscripts аге in accord with (һе conventional 
notation for factorial experimentation in which the main factors are 
referred to as A, В, С, etc., and the small letters indicate the actual 
combinations, or the yield results. The subscripts indicate the various 
levels. Since the main effects are represented by A and B, the interaction 
will be AB. 

The table indicates immediately that the 17 degrees of freedom for 
treatments can be divided into 


A 2 DF 
B 5 DF 
AB 10 DF 


and from our knowledge of the methods of analysis of two-way tables 
there is no difficulty in dividing up the sums of squares similarly. 
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The procedure in setting up this experiment, if we regard both factors 
of equal importance, will be to select a randomized block design and 
consider the 18 combinations as individual treatments. In other words 
each set of 18 will be randomized over an entire block. 

The analysis of the results will be according to the following outline, 
assuming 3 replications. 


DF 

Replicates 2 
A 

Treatments, 17 DF4 В S 

AB 10 

Error 34 


A variant of the analysis that can be introduced here if necessary is to 
split up the degrees of freedom and sums of squares for A into linear and 
quadratic components (see Chapter 10, Section 1). This would ordinarily 
be done when the increase in yield with rate of seeding appears to be 
linear, the object being to make an actual estimate of the linear increase 
in yield for each additional 30 pounds in the rate of seeding. 

3. Split-plot experiments. There are two conditions under which it is 
not satisfactory to randomize all combinations over an entire block. The 
first is when one of the factors is such that it cannot conveniently be 
applied to individual units. In plant disease work, for example, it may 
be necessary to establish different soil organisms that are to be tested for 
their effect on crops. These organisms cannot be established and kept 
Separate on individual plots, and the best that can be done is to put them 
on reasonably convenient strips or blocks, which, however, are much too 
large for single plots. Assuming, therefore, an experiment with 4 soil 
organisms and 4 varieties, the plan would have to be set up somewhat as 
follows, where the organisms are indicated as levels of A and the varieties 
as levels of B. The plan shows | replicate only. 


PLAN OF SpLit-PLoT EXPERIMENT 
abı ab, ab, аф аб, d), ау, asb, 


абу, ауа аЬ, аб abı а» ab, аф 


In the actual plan the levels of В would of course be randomized within 
the 4 plot blocks, and the levels of A among the 4 plot blocks. 

An experiment of this sort is referred to as a split-plot experiment. 
Note that there are 2 plot sizes: those that contain levels of 4, referred 
to as the main plots, and those within the main plots containing the 
levels of B, referred to as the sub-plots. Any difficulties in arriving 
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at the correct form of the analysis of split-plot experiments are due 
usually to a failure to grasp the principle of the presence of distinct 
plot sizes. 

To arrive at the method of analysis let r — number of replicates, 
p = number of main plots per replicate, and g = number of sub-plots per 
main plot. Here we have p = 4, 4 = 4, and we shall take r = 3. Тһе 
A levels are being tested in a simple randomized block in which r = 3; 
therefore the analysis for A will take the form 


DF 
Replicates 2 
A 
Error (a) 6 


We note next that all В comparisons are within main plots; therefore we 
are concerned only with degrees of freedom within the main plots, making 
a total of 3 x 12 = 36 in all. Out of the 36 we have 3 for the levels of 
B and 9 for the interaction AB. It should be clear that this is the correct 
place for the interaction as it arises from differences between levels of В 
for given levels of A. The analysis for sub-plots is therefore 


DF 
B 3 
AB 9 
Error (b) 24 


Note that the degrees of freedom for error (0) can be obtained by sub- 
traction, or from (3 + 9) x 2, where 3 + 9 comes from the treatment 
degrees of freedom for B and AB, and 2 from the degrees of freedom for 
replicates. In general, therefore, the complete analysis is 


DF 
Replicates ET 
A p-1 
Error (a) ("= 00р— 1) 
В 4-1 
АВ (р- 4-1) 
Error (b) p(q— 0 1) 


To obtain sums of squares a straightforward procedure can be followed. 
The one exception is that it is usual to put all sums of squares on a 
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sub-plot basis. Thus, if A; represents a replicate total, the replicate 
sum of squares is given by 


Also, if A; represents a main-plot treatment total, the sum of squares for 
Ais 


208 а 


ЖИЕГІ 


Further details of the analytical procedure can be obtained from Example 
12-1. 

The second condition wherein the split-plot experiment is the most 
satisfactory is when one or more factors are included merely to increase 
the scope of the experiment and where results on the main effects of this 
factor are not required. In plant disease work, when a series of seed 
treatments are to be tested, it is customary to include more than one 
variety in the test or perhaps more than one cultural practice. The 
object is not to obtain information on the mean yields of the varieties 
but to observe if the treatments behave alike on all varieties. The logical 
procedure is to sacrifice accuracy on the variety comparisons by putting 
them in the main plots, as thereby an increase in accuracy is obtained for 
the seed treatments and for the interaction of these treatments with the 
varieties. 

A general picture with respect to the efficiency of split-plot experiments 
can be obtained by noting that the design does not result in an average 
increase in accuracy over randomized blocks. The randomized block 
error would in fact be given by summing the sums of squares and degrees 
of freedom for error (а) and error (b). The effect is to split the random- 
ized block error into two parts, error (a), which is larger than the 
randomized block error on the average, and error (5), which is smaller. 
It is simply a question of sacrificing precision on one set of comparisons 
in order to obtain increased precision on another set. 

4. Standard errors in split-plot experiments. There is a slight complica- 
tion in the comparison of means from a split-plot experiment which is 
interesting not only from the practical but the theoretical standpoint. In 
the type of experiment we have outlined in Section 3 there are four 
different types of comparisons that one might wish to make. These are 
as follows, together with formulas given by Cochran and Cox [2] for the 
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standard errors of a mean difference. Е, refers to the mean square for 
error (а) and Е, to the mean square for error (0). 


1. Between 2 means of A over all levels of B, such as d, — dy: 


2E, 
rq 


2. Between 2 means of B over all levels of A, such as b — Ba: 


2E, 
rp 


3. Between 2 means of B for 1 level of A, such as a,b, — 2,63: 


r 


4. Between 2 means of A for 1 level of B, such as a,b, — а: 


2(4- DE, + E] 
Я 


The first three comparisons are straightforward and аге merely variations 
of the usual formula with all results expressed on a single-plot basis. 
The fourth comparison involves the 2 errors of the split-plot experiment 
and is a weighted mean of E, and E,. These 2 errors are admittedly 
different, and we are faced with the problem of making a test of significance 
of a difference between 2 means drawn from populations having a different 
variance. This problem has been dealt with by Behrens [1] and Fisher 
[5], and tables at the 5% level have been prepared by Sukhatme [8]. For 
our purpose an approximation to the exact test suggested by Cochran and 
Cox [2] is sufficient. 

Since E, and E, are based usually on different degrees of freedom, 
foo; for each case will be different. The approximation involves calcu- 


lating a г as follows. 
(q — DE ity + Еш 


t= 


(q— DE, + E, 


where 1, = 1 for Ep, and t, = fos for Ea: 

5. Example 12-1. A simple split-plot experiment. Table 12-1 gives 
the yields obtained in an experiment with barley on the control of disease 
by dusting the seed before planting with mercury dust. Six classes of 
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seed, based on the degree of infection were included, the infection classes 
being arbitrarily designated 1, 2, 3, - > >, 6. Each plot was divided, one 
half receiving dusted seed and another half receiving undusted seed. The 
yields are reported in bushels per acre. 


TABLE 12-1 


YIELDS OF BARLEY IN А SPLIT-PLoT EXPERIMENT 


Replicate Treat- | Infec- 
Infec- Treat- p ment | tion 
Hong ament. ааб 7-18. |тош | Total 

1 D 52.6 53.8 544 57.6 63.6 57.6 53.5 442 | 4373 
U 53.4 64.8 63.2 55.8 50.1 59.7 49.4 56.4 | 452.8 | 890.1 

2 D 57.0 64.8 55.4 55.4 54.0 61.4 54.0 53.0 | 455.0 
U 54.2 56.2 540 58.8 58.0 43.8 50.1 53.8 | 428.9 | 883.9 

3 D 58.9 58.8 58.2 56.2 54.2 56.1 56.4 54.0 | 452.8 
U 54.2 49.8 54.0 45.6 57.6 63.9 59.4 63.2 | 447.7 | 900.5 

4 D 50.4 63.4 58.2 63.0 63.6 60.0 55.2 501 | 463.9 
U 48.0 57.4 540 57.4 50.4 45.6 48.0 50.1 | 410.9 | 874.8 

5 D 49.0 64.8 62.2 57.0 68.4 65.4 64.8 57.0 | 488.6 
U 39.2 40.2 51.0 49.8 48.2 64.8 53.4 52.8 | 399.4 | 888.0 

6 D 55.4 52.8 51.4 63.0 60.2 45.6 61.2 58.8 | 448.4 
U 33.4 382 40.2 57.0 49.0 55.4 44.6 47.0 | 364.8 | 8132 
Replicate Total| 605.7 665.0 656.2 676.6 677.3 679.3 650.0 640.4 5250.5 


The first step is to outline the analysis. 


This is given below. 


DF 

Replicates 7 
Infection classes 5 
Error (a) 35 
Dusting 1 
Dusting х Infections 5 
Error (6) 42 
Total 95 


To complete the first set of calculations based on the main plots we 
require the main-plot totals, and, for the remainder, the differences for 
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the dusted and undusted pairs making up each main plot. These are 
given in Table 12-2. Either one of these sets of values can be omitted 
and the required sum of squares obtained by subtraction, but the method 
illustrated provides a complete check on the calculations. 


TABLE 12-2 


MarN-PLOT TOTALS AND CORRESPONDING DIFFERENCES BETWEEN 
DusrED AND UNDUSTED PLOTS 


Replicate 

1 2 3 4 5 

| 1060 1186 117.6 1134 113.7 
085-110 881 1:839 13.5 

a 1112 1210 1094 1142 1120 

28 86 14 —34 -40 

113.1 108.6 1122 101.8 1118 

Infection 47 90 42 106 -34 
class 4 984 1208 1122 1204 114.0 
2:4; (60:, 22. SG 122 

5 882 1050 113.2 106.8 116.6 
Ото 202 

2 888 91.0 91.6 1200 1092 
22027146: ШОЛ 760: Tz 


Grand total 
Difference total 241,5 


813.2 
83.6 


5250.5 


It is a good plan to determine the total sum of squares and divide this 
into between and within main plots by direct calculation. Thus, 


2 
Total = 291,475.77 — SX = 291,475.77 — 287,164.07 
— 4311.70 
Main plots = ae 287,164.07 = 1889.57 
4844.25 


Within main plots = 


= 2422.12 


Then 1889.57 + 2422.12 = 4311.69, giving a check on the calculations. 
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The next step is to divide up the main-plot sum of squares according to 
replicates, infection classes, and error. We have 


__ 3,450,279.43 


Replicates = T 287,164.07 — 359.22 
70.75 , 
Infection classes — mee — 287,164.07 = 309.10 


Then the main-plot error = 1889.57 — 359.22 — 309.10 = 1221.25. 

In order to divide up the sum of squares for within main plots we 
calculate 
15.5? ++ 26.12 + = > + +83.67 24155" 


Interaction 


16 96 
= 1168.88 — 607.52 = 561.36 
241.5? 
U = —— = 607.52 
D v: 96 
Error 2422.12 — 561.36 — 607.52 — 1253.24 


The complete analysis can now be set up. 


SS DF MS in 5% Point 
Replicates 359.22 7 51.32 
Infection classes 309.10 о 61.82 1.77 2.49 
Error (a) 1221.25 35 34,89 
Dys U 607.52 1 607.52 20.36 4.07 
Interaction 561.36 5 112.27 3.76 2.44 
Error (0) 1253.24 42 29.84 
Total 4311.69 


The means for the infection classes are not significantly different, but this 
does not mean that the experiment has not supplied the information 
required. The objective of the experiment was to compare the treated 
and untreated plots and to find out if the effect of treating was dependent 
on the extent of the infection on the seed. The latter point is brought 
out by the significance of the interaction. It is indicated in the data by 
the trend of the differences shown in the last column of Table 12-2. 
Standard errors of mean differences can be worked out according to 
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the formulas of Section 4. They are illustrated below for the 4 types of 
comparisons that may be needed. 


1. Infection class 4 with 6. Mean difference = 54.68 — 50.82 = 3.86. 


SE (mean difference) = "i nd = 2.09 
3.86 : 
t= 209^ 1.85, 5% point for 35 degrees of freedom = 2.03 


2. Dusted and undusted. Mean difference = 57.21 — 52.18 = 5.03. 


29.84 
SE (mean difference) = [Ps -= 1.12 
5.03 ; 
p= T7 4.49, 5% point for 42 degrees of freedom = 2.02 


3. Dusted and undusted for infection class 4. Mean difference = 6.62. 


2 x 29.84 
SE (mean difference) = i — 512,73 
N 


t= 59 = 2.42, 5% point for 42 degrees of freedom = 2.02 


4. Infection class 2 with infection class 6 for the undusted plots. Mean 
difference = 53.61 — 45.60 = 8.01. 


2(34.89 + 29.84) 
SE (mean difference) = / eS ad ова 
8.01 
t= — 2, 
532 = 282 


Since бо for 35 degrees of freedom = 2.03 and for 42 degrees of 
freedom is 2.02, any difficulty due to the standard error being a weighted 
mean is of not much importance here. However, the method of finding 
an approximate / at the 5% level recommended by Cochran and Cox [2] 


is given as an illustration. 


_ 29.84 x 2.02 + 34.89 x 2.03 
n 29.84 + 34.89 


= 2.03 
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6. Confounding in factorial experiments. It has already been pointed 
out in Chapter 11, Section 3, that if a large number of treatments are 
tested in a randomized block experiment the design is not likely to be 
efficient. In field plot experiments this is a reflection of the fact that 
plots that are close together tend to yield alike. Thus the maximum 
control over the error would in most cases be obtained by having only 2 
long narrow plots placed side-by-side in each replicate. As the size of 
the block increases, variations between plots at some distance from each 
other are not only uncorrelated but major changes such as definite 
fertility trends occur within the block. A point is reached in increasing 
the number of plots in the replicate, where error control is practically 
non-existent. 

In factorial experiments the device known as confounding is introduced 
in order to reduce the size of the blocks and obtain greater error control. 
Confounding involves the selection of certain comparisons deemed 
relatively unimportant and arranging the treatment combinations in such 
a way that these unimportant comparisons cannot be separated from the 
block effects, the blocks being subdivisions of a complete replicate. All 
other comparisons are made within blocks, and consequently the error to 
which they are subject is less than if they were distributed at random over 
the entire replicate. The comparisons entangled with block effects are 
said to be confounded with blocks. Although this would appear at first 
to involve a definite loss of information on certain comparisons, we shall 
find that by a process known as partial confounding it is frequently 
possible to increase the over-all accuracy to such an extent that even the 
partially confounded comparisons are determined more accurately than 
in a non-confounded design. 

There are now a great many designs for confounding factorial experi- 
ments, and it is therefore impossible to describe them here.* We are 
concerned chiefly with making clear the principles on which such experi- 
ments are based, and this can best be done by explaining the simpler 
designs in some detail. Detailed descriptions and layouts of the various 
types of confounded designs can be found in Fisher [6], Yates [11], and 
Cochran and Cox [2]. 

7. Complete confounding іп a 2° experiment. А 2? experiment is one in 
which there are 3 factors, each at 2 levels. Following the conventional 
notation of Section 2, the factors can be represented by А, В, and С, and 
if the first level of each factor is zero the combinations are (1) a, 5, с, ab, 


* The split-plot design discussed aboye actually involves confounding in that the 
majn-plot treatments may be thought of as being confounded with main plots, 
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ас, be, and abc, where the symbol (1) represents the zero level of all 3 
factors at once. Thus in detail 


(1) = аоросо 

b = аұбс0 

ab = aybycy 
etc. 


It will be realized that this symbolism will apply equally well even if the 
first level of any of the factors is not a zero level. Comparable to the 
treatment combinations there are the effects of the combinations, repre- 
sented by А, B, C for the main effects, АВ, AC, BC for the 2-factor 
interactions, and ABC for the 3-factor interaction. The 7 degrees of 
freedom for the 2? = 8 treatment effects аге 


DF 


A 

B 

с 
АВ 
AC 
BC 
ABC 


If the treatment combinations were randomized within each replicate of 
8 plots, with 4 complete replications, the analysis of variance would be of 
the form 


DF 
Replicates 3 
Main effects 3 
2-factor interactions 3 
3-factor interaction 1 
Error 21 


As a basis for introducing confounding into this experiment it is 
necessary to decide on the treatment combinations that enter into the 
different comparisons. The main effects are obviously 


A= [a + ab + ac + abc] — [(1) +b + c + be] 
B = [b + ab + be + abc] — [(1) + a + c + ac] 
С = [c + ас + bc + abc] — [(1) + a + b + ab] 
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these equations being purely formal and not representing actual totals or 
means wherein summations and divisors would be given. For inter- 
actions, as well as for main effects, a rule can be given for writing down 
the combinations directly, but it is preferable at first to see how this can 
be done from first principles. Thus the interaction 4B will come from 
the difference between the diagonal totals of the 2 x 2 table. 


B 


b + be 


(1)--с 


a+ ac | ab + abc 


Therefore the required contrast АВ is 


AB = [(1) + c + ab + abc] — [a + ac + b + bc] 


and in a similar manner 
H AC = [(1) + ас + b + abc] — [a + c + ab + bc] 
BC = [(1) + be + a + abc] — [b + c + ab +- ac] 


The 3-factor interaction can be derived from the two 2 x 2 tables: 


c absent с present Xd 
which show that 
ABC — [a + b + c + abe] — [(1) + ab + ac + bc] 


The simplest method for practical purposes is to follow the rule given 
by Finney [3] which enables us to write down the components of the 
Tequired comparisons directly. First we define the block or group of 
combinations Containing the treatment (1) as the principal block. Then 
for a 3-factor interaction such as АВС, referred to as the defining contrast, 
we write down for the principal block, (1), plus all combinations con- 
taining 2 of the letters in ABC. This gives directly (1), ab, ас, bc. The 
Temaining group can be obtained by writing all combinations not in the 
principal block or by a process of multiplication involving a slight modifi- 
cation of the rules of algebra. We say (1) x a = a, a x c = ac, in the 


ь “2 
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conventional manner, but a x ab = b, ac x c =a, etc., and the rule is 
that if a letter occurs as a square it is struck out. Now if we take the 
principal block and multiply by any combination not in the block, the 
second block is generated thus 


[(1) + ab + ac + bc] a = a + b + c + abc 


or 


[(1) + ab + ac + bc] b = b + a + abc + c 


To obtain the combinations for a 2-factor interaction such as AB, 
where a third factor C is involved, it is simply necessary to write the 
principal block with C absent and C present. It is convenient to write 
AB(C) for such an interaction, giving 


AB(C) = (1) + ab + [(1) + ab] c = (1) + ab + c + abc 


As a check on the method we can apply it in writing the combinations 
for a main effect. For example, the principal block for A(BC) is 


(1) + (Dib + c + be] = (1) +b + c + be 


and the remaining block is a + ab + ac + abe. In the bracket in the 
expression on the left, the letters b, с, be аге simply all combinations of 
the letters bc, one and two at a time. It will be shown later how this 
rule can be generalized for the design and analysis of 2" experiments 
where its chief value lies. 

In order to avoid confusion in writing expressions such as those given 
above, it is well to have a rule for signs and some method of indicating 
whether the expression is a purely formal one in order to decide on the 
combinations involved in a given comparison, or whether it refers to a 
total or mean effect determined from actual numbers. For the total 
effect of A from an actual experiment we can take 


A, = ([(1) + b + с + be] — [a + ab + ac + abc}}(— 1) 


where the exponent of (— 1) represents a main effect and ensures that the 
sign will be correct. In this formula it is assumed that the values within 
the brackets are summed over all replicates. To abbreviate we can put 


A, = (1) + b +c + bc 
A, — a + ab + ac + abc 


Therefore A, will represent summation over all replicates for the combin- 
ations in which a is absent, or from a different standpoint over all 
combinations falling in the principal block if A were confounded. Finally 


A, — (45 4 1 
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and the mean effect of A is 


lo 4 

ue АР (А„— А„)(— 1)! 
where the 4 in the divisor arises from the fact that there are 3 factors; 
i.e., for n factors the divisor will be 2"+r. Following the same notation 


ا 
АВ = (4B, — AB,\(— 1}‏ 


E 71 
ABC = = (ABC, — ABC,\— 1) 


Returning now to the procedure of confounding, the least important 
comparison is the 3-factor interaction ABC, and, if one comparison is to 
be completely confounded with blocks, this is the logical selection.* In 
each replicate therefore we will have 2 blocks as follows: 


Block 1 Block 2 
Replicate | (1) ab ac bc | a b c abc | 


and it is clear that with this layout the 3-factor interaction will be com- 
pletely confounded with blocks. In other words from such an experiment 
ABC will be in part due to the interaction and in part to the blocks. 

If we have 4 similar replicates, the outline of the analysis will be 


DF 

Blocks 7 
Main effects 3 
2-factor interactions 3 
Error 18 
Total 31 


As compared to a randomized block plan there are 4 more degrees of 
freedom for error control, one of which represents the interaction ABC; 
, therefore the error degrees of freedom is reduced by 3. It is important 


* Obviously the objective of the experiment must be considered in selecting the 
comparison to be confounded. The 3-factor interaction could have the most important 
interaction effect and in that case should not be confounded. 
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to note that the 3-factor interaction has not been completely sacrificed in 
that the 7 degrees of freedom for blocks can be divided up into 


DF 
Replicates 3 
ABC 1 


Error (between blocks) 3 


and hence a very rough test is available. However, it is unlikely that the 
test will be made as the effect of the interaction would ordinarily have to 
be very large to reach significance. 

Since the procedure adopted here is open to some criticism in that we 
have completely sacrificed one effect, it is of interest to inquire into another 
confounding procedure which enables us to distribute the confounding 
over more than one effect by confounding a different degree of freedom 
in each replicate. This is known as partial confounding. 

8. Partial confounding in a 2? experiment. Partial confounding in a 23 
experiment can be obtained by confounding АВ in the first replicate, 
AC in the second, BC in the third, and ABC in the fourth. The blocks 


will be: 
Comparison Confounded 


Replicate 1 | (1) ab с abc |a b ac bc | AB 
Replicate 2 | (1) ac b abc|a c ab be | AC 
Replicate 3 | (I) bc a abc |b c ab ac | BC 
Replicate 4 | (1) ab ac bc |a b c abe | ABC 


We can now speak of an interaction such as AB being partially con- 
founded because it is confounded in only 1 replicate out of 4. The 
analysis of this experiment can be outlined as follows. 


Jg 
ЕЛ 


Blocks 
A 

B 

с 

АВ 
АС 
ВС 
АВС 
Error 


Precision factor = $ 


ی انات انت ت ge‏ 


The precision of the interactions is 3/, because they are tested in only 3/4 
of the replicates. The method of analysis is to determine the sums of 
squares for the blocks and the main effects in the usual way, and the 
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interaction sums of squares from those blocks in which they are not 
confounded. The details are illustrated in Example 12-3. 

9. Adjustments to means in confounded 2? experiments. In order to see 
why adjustments should sometimes be required in means taken from 
confounded experiments, suppose that we have 2 blocks in which the 
yields are represented as follows. On the left we have the various treat- 
ment combinations. In the center these are represented by algebraic 
symbols; and on the right by figures. 
I I 


4 p a 


(1) a х+® ytfà 4+2 643 
ab b х+= ур, 4-6 6+2 
ас c хо ytf, 443 6+8 
bc ас хо y +f, 4-5 6+7 


Mean х-ке y+B 44-4 645 


The block effect is represented by x in block p and y in block q. Then о) 
corresponds with the response to treatment (1), f, to the response to 
treatment a, and so forth. If we wish to compare ab with b, it is clear 
that the comparison we require is а, — f» but what we have is 
(x + 95) — (у + Be) or (x — y) + (0 — By) which contains the block effect. 
Suppose then that we have some means of estimating x — f, which for the 
particular blocks we have set up is the interaction effect ABC as this is 
the comparison that is confounded. This gives us 


Dm (difference of block means) = (x — y) + (& — B) = (x — y) + ABC 


Then x— y = D, — ABC. The difference between the 2 means we wish 
to compare is 


d, = (х= y) + (0 — В) 
^. @»— В) = d — (x — y) = d,— (D, — ABC) 


and the term on the right is the required correction. To apply a correction 
to an individual mean we can put (x — y) — D,, — ABC in the form 


х= KD, — ABC) = y+ KD,,— ABC) 


Therefore the correction to means in block p is — 4(D,, — ABC) and in 
block q it is + 3(D,, — ABC). 
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Applying this to the numerical example, we һауе « -В-4-5--і, 
which in this case is known but normally estimated from other data. 
Then 


3D,—(x—8)——3(8—1)0—(-0]—4 1 
and = 
3D, — &— B] ——1 


The corrections applied to the means of ab and a give 10 + 1,— 11, and 
(8— 1) = 7. The difference is 11 — 7 = 4, and this corresponds as it 
should to «4 — f = 6— 2 = 4. In applying this principle to actual data, 
it is only necessary to make sure that all calculations are on a single-plot 
basis. The procedure is illustrated in Examples 12-2 and 12-3. 

10. Example 12-2. Complete confounding in a 2? experiment. The 
factors are A, B, C, and the experiment is conducted in 4 replicates. In 
each replicate the interaction ABC is confounded. The results are as 
shown in Table 12-3. An actual experiment of this type not being 
available, the data are taken from uniformity trials by Sayer et al. [7]. 


TABLE 12-3 


YIELD RESULTS FROM А 2? EXPERIMENT 


i IL, Il, 


p 


(D 11 а 186  () 207 а 259 
ab 192 b 182 ab 221 5 230 
ac 188 с 19.0 ас 2129 с 24.9 
be 194 abe 204 be 201 abe 234 


Block Total 76.5 76.2 84.1 97.2 


(1) 24 а 222 (1) 191 а 236 
ab 204 b 210 ab 295 23Л 
(с 232 NC 23.6 ас 186 с 21.0 
be 203 abe 216 be 215 abe 228 


Block Total 87.3 88.4 81.1 91.1 
G = 681.9 


(1) а b c ab ac be abc 
Treatment Total 82.3 90.3 85.9 885 83.6 818 813 88.2 
Treatment Mean 20.58 22.58 21.48 22.12 20.90 20.45 20.32 22.05 
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The calculation of sums of squares proceeds as follows. 


2 
Total 14,658.870 — Бе = 14,658.870 — 14,530.863 = 128.007 


8,491.6 
Blocks 3 24 


— 14,530.863 = 14,622.902 — 14,530.863 = 92.039 


A (343.9 — 338.0)?/32 = 5.9732 = 1.088 
B (339.0 — 342.9)?/32 = — 3.9732 = 0.475 


Ç (339.8 — 342.1)У32 = — 2.3?/32 = 0.165 


AB {[(1) + ab + c+ abc]— [a + b + ac + be}}?/32 
(342.6 — 339.3)2/32 = 3.3/32 = 0.340 


AC а) + ac + b + abc]— [a + c + ab + be}}?/32 
(338.2 — 343.7)2/32 = — 5.52/32 = 0.945 


BC {{(1) + bc + a + abc] — [b + c + ab + ac]}?/32 
(342.1 — 339.8)/32 2.37/32 = 0.165 


The complete analysis can now be set up. 


SS DF MS o 5% Point 

Blocks 92.039 ЧЕ 7 43:15 7:22 2.58 
А 1.088 1 1.088 0.60 4.41 
В 0.475 1 0.475 0.26 4.41 
с 0.165 i 0.165 0.09 4.41 
AB 0.340 1 0.340 0.19 4.41 
AC 0.945 1 0,945 0.52 4.41 
ВС 0.165 1 0.165 0.09 4.41 
Error 32.790 18 1.822 


Total 128.007 31 


Note that the interaction ABC is not calculated since it is completely 
confounded with blocks. In this example none of the treatment effects 
is significant as would be expected for data taken from a uniformity trial. 
The dummy treatment effects appear actually to be abnormally low, an 
effect that seems to be due to some systematic characteristic of the data. 

Corrections to means for block effects cannot be made effectively in an 
experiment of this sort unless we are prepared to assume that the inter- 
action ABC is negligible as it cannot be eliminated from the block effect. 
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Assuming that ABC is negligible, the means can be set up in 2 groups as 
follows, corresponding to the arrangement in blocks for confounding 
АВС. 


(1) 20.58 а 22.58 
ab 20.90 b 21.48 
ac 20.45 с 2212 
be 20.32 abe 22.05 
Group means 20.56 22.06 
Difference + 1.50 
Correction + 10275 — 0.75 


The correction is simply !/, the difference between the 2 group means, 
and the sign is taken such as to eliminate the block difference in a com- 
parison of any 2 means in different groups. Thus ab becomes 20.90 + 
0.75 = 21.65, and b becomes 21.48 — 0.75 = 20.73. 

It should be clear that, if there is a real ABC effect, this correction 
which assumes all the difference to be block effect would not be justified. 
Its application must be governed by judgment of a particular case. If 
the 3-factor interaction is likely to be small, as evidenced for example by 
a low value of the 2-factor interaction, and the block effect is appreciable, 
the correction should be made. It should also be clear that two-way 
tables showing 2-factor interaction effects do not require correction. The 
following table is an example. 


Corrections are not required here because the main effects A and B and 
the interaction AB are not confounded. 

Single expressions representing the mean responses for the main effects 
and interactions are often useful. These are defined algebraically in 
Section 7. Here we have 


_ (843.9 — 338.0) 


= +037 
16 т 
T —39 
ay SSeS eun dt 
В ШЕ ай 
rae есеге UC 


16 
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— 3.3 

= — = + 0.21 
АВ 16 T 
= — 5.5 

= —— = — 0.34 
AC 16 
== 2.3 
ВС = => = + 0.14 

С 16 xi 


11. Example 12-3. Partial confounding in a 2? experiment. In this 
experiment the factors are 4, B, C, and the data are given in Table 12-4. 
Since an actual example was not available, Table 12-4 was set up from 
uniformity data from Sayer et al. [7]. 


TABLE 12-4 


DATA BY BLOCKS AND TREATMENTS FOR A PARTIALLY CONFOUNDED 
23 EXPERIMENT 


Effect confounded АВ AC 
Block I5 i" II 


(1) 257 a 232 (1) 276 a 256 
а ALU D AAO -ас 267 с 2799. 
с 17.6 ac 186 b 262 ab 285 
abe 17.5 bc 183 abç 220 bc 272 


Block Total 81.9 81.1 102.5 109.2 


ш, ш, IV, IV, 


(1) 21.4 b 18.8 ISS: а 25.4 
be 186 с 160 ab 214 b 26.9 
а 188 ab 164 ас 206 с 25.2 
abe 182 ас 166 be 224 abe 30.1 


Block Total 77.0 67.8 88.3 107.6 
G= 715.4 
(1) а~ b c ab ac bc * abc 


Treatment Total 98.6 93.0 929 867 874 825 865 878 


Treatment Mean 
(uncorrected) 24.65 23.25 23.22 21.68 21.85 20.62 21.62 21.95 
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The sums of squares for the total and for blocks are calculated as in 
Example 12-2. For the treatment main effects the sum of squares can 
be calculated directly from the treatment totals in Table 12-5, but for a 
particular interaction we calculate the interaction sum of squares only 
from those replicates in which the interaction is unconfounded. It is 
convenient to prepare the totals as in Table 12-5. 


TABLE 12-5 


TREATMENT TOTALS FOR CALCULATION OF INTERACTIONS 


For Calculation of AB AC BC ABC 
All Less Less Less Less Check 
Treatment Replicates I п ш IV Sum 
(1) 98.6 72.9 71.0 77.2 74.7 295.8 
а 93.0 69.8 67.4 74.2 67.6 279.0 
b 92.9 71.9 66.7 74.1 66.0 278.7 
с 86.7 69.1 58.8 70.7 61.5 260.1 
ab 87.4 66.3 58.9 71.0 66.0 262.2 
ac 82.5 63.9 55.8 65.9 61.9 247.5 
be 86.5 68.2 59.3 67.9 64.1 259.5 
abe 87.8 70.3 65.8 69.6 57.7 263.4 


Grand Total 715.4 552.4 503.7 5706 519.5 
Block Total 163.0 211.7 144.8 195.9 
Sum 715.4 715.4 715.4 715.4 


In the preparation of Table 12-5, a double check can be obtained. In 
the first row the total 295.8 — 98.6 x 3. Then the total of each column 
plus the corresponding block total gives the grand total for the experiment. 

We now calculate the sums of squares for treatment effects, keeping in 
mind that the interactions are based on only 24 plots. At the same time 
it is convenient to calculate the mean effects. 


А —14.07/32= 6125 A  ——140/16 = — 0.875 
B -62/32- 1200 B = —62][16 = — 0.388 
C —28.47/32= 25.205 С =—28.4/16 = — 1.775 
AB 4.8/24 = 0.960 AB = 4812- 0.400 
АС — 14.92/24 = 9250 АС =~ 14.9/12—— 1.242 
BC 722/24 = 2160 BC = 7.2/12= 0.600 


ABC — 13.92/24 = 8.050 АВС = — 13.9/12 = — 1.158 
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The complete analysis is given below. 


55 DF MS 
Blocks 410.389 7 58.63 
A 6.125 1 6.125 
B 1.201 1 1.201 
с 25.205 1 25.205 
АВ 0.960 1 0.960 
АС 9,250 1 9.250 
BC 2.160 1 2.160 
ABC 8.050 1 8.050 
Error 63.419 17 3.731 


Total 526.759 31 


If a table of individual treatment means is to be presented, it is necessary 
to make corrections for block effects. The procedure shown below 
demonstrates how this can be done systematically, the first step being to 
calculate the appropriate corrections. We list the block totals in pairs, 
showing the effects confounded in a particular pair. The subscript p 
indicates the principal block, and 4 the remaining block. The plus and 
minus signs show how the difference between the block totals is to be 
taken. These arise from the conventional expressions for main effects 
and interactions such as 


АВ = 1(а— 1)(Ь— 1)(с + 1) 
ABC = Қа — 1)(b— 1)(с — 1) 


Note that the principal block always contains the treatment combina- 
tion (1); therefore the sign can be determined by inspection. The 
formulas show that for a 2-factor interaction the results for the treatments 
in the principal block are added, and for a 3-factor interaction they are 
subtracted. The blocks can be indicated as plus or minus, and the 
procedure given leads to the correction for the minus block. The correc- 
tion for the plus block is the same value of opposite sign. 


Comparison confounded AB AC BC ABC 
Block ӨРЕ ТІҢ Welty CUT ӨШТІ ЕУ; ӨЛІ; 
Block total 819 81.1 102.5 1092 77.0 67.8 883 107.6 
Sign dues * Ас, D ағыту + 
Difference 0.80 — 6.70 9.20 19.30 
Difference/8 0.10 — 0.84 115 241 
4 mean effect (calculated 

above) 0.20 - 0.62 0.30 - 0.58 
Difference = Correction to 

minus block — 0.10 —0.22 0.85 2.99 


Correction to plus block +010 + 0.2 — 0.85 — 2,99 
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These corrections are on a single-plot basis; therefore a corrected total 
will be given by adding the corrections, one for each plot in the original 
total. ` Thus the corrected totals for some of the treatments are 


(I) 98.6 + 0.10 + 0.22 — 0.85 + 2.99 = 101.1 
a 93.0 — 0.10 — 0.22 — 0.85 — 2.99 88.8 


b 92.9 — 0.10 + 0.22 + 0.85 — 2.99 = 90.9 
etc. 


In comparing means it is important to keep in mind that the interactions 
are determined on only ?/, of the replications; thus the standard error of 
a mean interaction response is 


2 х 3.731 
/ отар 0.788 
The error is the mean square from the analysis of variance. It is multiplied 
by 2 because such a response is a difference between 2 means. It is 


divided by 12 because it represents a difference between means of 12 plots. 
Similarly the standard error of a main effect will be 


2 x 3.731 
206 
J = 83 


Since only 1 degree of freedom is involved in each comparison, the 
analysis of variance table may be dispensed with and the / test applied 
to all effects. There is no reduction in the work, however, through this 
procedure, 

12. Confounding in 2" experiments. For а 2° experiment with the 
factors A, В, C, D, E, it is obvious that we will have 32 treatment combina- 
tions. Confounding of one comparison such as the interaction ABCDE 
will enable us to divide the replicates into 2 blocks of 16 plots each, but 
these may not be small enough to bring about sufficient error control. 
In order to divide the replicate into 4 blocks of 8 plots, it is necessary to 
confound 3 degrees of freedom. The selection of the contrasts for con- 
founding is simplified by the rule given by Yates [11] and described further 
by Finney [3]: If two contrasts such as ABC and DE are confounded, 
their product as defined by the algebraic rule given in Section 7 is also 
confounded. In this example the third contrast would be ABC x DE 
= ABCDE. If we selected ABCD and BCDE, their product AE would 
also be confounded. The reason for this is seen by examining the 2? 
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experiment again and deciding to confound AB and AC. The principal 
blocks for confounding AB and AC separately are 


AB | (1) ab с abc | 
AC |) ace babe | 


The combinations that are common to both make up the principal block 
for confounding AB and AC. This is | (1) abc |, and by multiplication 
the remaining blocks are generated, giving the four 2-plot blocks 


Қа) ас ѓа ГЕ |І БЕ "ае б. eb: | 


Now, оп writing ош the principal block for confounding BC, which is 
| (1) bc a abc |, it is noted that this contains the first 2 of the blocks 
given above, and it is clear that BC is confounded. 

By means of this rule it is possible to make a wise selection of the 
effects to be confounded and to write out the blocks without difficulty. 
For a 2° experiment the following are some of the sets of contrasts that 


can be confounded. 
ABCDE ABCD E 


ABCDE BCD AE 
ABE ACD BCDE 


The third set is obviously the best set to confound as it does not involve 
the confounding of a main effect or a 2-factor interaction. The procedure 
is illustrated below for writing out the blocks for confounding this set. 
The simplest method is to write out the principal block for confounding 
ABE and ACD separately. We have 


ABE (1) ab ae be с abe ace ӛсе 


а abd ade bde cd ама аде bede 


ACD (1) ac ad cd b abc abd bcd 


e ace ade — cde be аһсе abde bede 


Those that are common to both are underlined. Putting these down and 
multiplying successively by a, b, and c, the whole set is generated. 


ABE: ACD | (1) be cd ас ace abd ade bede | Block 1 
xa a abe acd bc ce bd de abcde | Block 2 
xb b e bcd ас abce ad abde cde Block 3 
же с іе а ab ае abcd  acde bde Block 4 


As a check we can write the principal block for confounding BCDE. 
This is 

(1) be bd be cd ce de bcde 

a abc abd ае acd ace ade abcde 
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and it will be found that the treatments are identical with the treatments 

of blocks 1 and 2. This proves that all 3 interactions are confounded. 
It is of course unnecessary to confound in the same manner in each 

replicate. There are 5 sets of 3 interactions of the type confounded here. 


These are: 
ABC ADE BCDE 


ABD BCE ACDE 
ACE BCD ABDE 
ACD BDE ABCE 
ABE CDE ABCD 


The above comprise a balanced set, as in 5 replications each of the inter- 
actions is confounded once only. All interactions are tested on 4/; of 
the replications. 

It is obvious that the confounding procedure can be continued indefi- 
nitely as the number of factors is increased. Each additional factor 
doubles the number of treatments per replicate, and more degrees of 
freedom must be confounded in order to retain error control. For 
example, with 6 factors we have 2° = 64 treatment combinations, and we 
have to confound 7 degrees of freedom in order to have 8-plot blocks. 
A logical selection of interactions to be confounded would be 


ABCD ABEF | CDEF AED BCE BDF ACF 


For further information on designs of this type, the reader is referred 
to Yates [11] and Cochran and Cox [2]. 

13. Factorial designs for 2" experiments in one replication. In a con- 
sideration of factorial experiments generally and especially with respect 
to 2” experiments it is of interest to write down the disposition of degrees 
of freedom among the main factors and interactions as follows. 


Factors 


Main effects 2 3 4 5 6 1 
2-factor interactions 1 6 10 152 2] 
3-factor interactions 1 4 1059701 35 
4-factor interactions 1 5 15 35 
5-factor interactions n 6 21 
6-factor interactions 1 7 
7-factor interactions А 1 


Percentage in interactions 
having 3 ог more factors 0 14.3 33.3 51.6 66.7 80.0 


Since we do not as a general rule expect to get important results from 
3-factor and higher interactions, in unconfounded experiments with 4 or 
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more factors, it is remarkable that so much of the experimentation effort 
is put into the determination of comparisons that are of no practical 
value. It follows that in such experiments confounding is practically a 
necessity if a reasonably high efficiency is to be expected. 

Another important point arises from a consideration of the analysis of 
variance procedure as a method of estimating variances, as outlined in 
Chapter 11. Each of the interactions provides an estimate of its own 
variance plus an estimate of the error variance, the latter being a measure 
of the extent to which all treatment differences vary from replicate to 
replicate. For any given mean square such as 52, the mean square for a 
5-factor interaction, we can put s — o; + og where a,” is the error 
variance. The error mean square 5,2 furnishes an estimate of o, directly. 
Now sg is not altered by the number of replicates except with respect to 
the accuracy with which the variances are estimated. In other words its 
mean value in a large series of trials will be the same regardless of the 
number of replicates in each trial. Consequently, even if we have only 
1 replicate, s¿ gives a valid estimate of o,* + 052. Assuming then that 
ag? is negligible, it follows that sg is a valid estimate of ¢,2. Тһе practical 
value of such an estimate will depend on the number of degrees of freedom 
available. In a 6-factor experiment, pooling the 3-, 4-, 5-, and 6-factor 
interactions gives a total of 42 degrees of freedom for an estimate of the 
error. Even with 5 factors there are 16 degrees of freedom for the 
estimation of error. The general conclusion is that, with a reasonable 
degree of assurance that the high-order interactions are not important, 
it is feasible to conduct such experiments in one replicate and to use the 
interaction mean squares as error. This is particularly important in 
exploratory experiments where it is important to include as many factors 
as possible. In other words it is more important to include these factors 
than it is to have replication. As a simple example we can compare an 
experiment involving 4 factors in 4 replications, making a total of 64 plots, 
with an experiment involving 6 factors in 1 replicate, again with a total 
of 64 plots. For a preliminary experiment, where very little is known 
about the effect of the factors, the 6-factor experiment in one replicate 
would be much more likely to provide the information required than the 
4-factor experiment in 4 replicates. 

14. Fractional replication. With 7 or more factors even a single 
replicate requires more plots than are usually available for a field plot 
experiment. Finney [3] has examined the problem of the replication of 
factors without replication of all combinations of their levels, so as to 
permit the drawing of conclusions from experiments having less than a 
single replicate. 

The basic principle of fractional replication can be illustrated with 
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reference to а simple 2-factor experiment. If ће factors are A, В, the 
treatment combinations are (1), а, b, ab, and testing only 2 of the treat- 
ments, (1) and ab, what kind of conclusions can be drawn? The only 
measure of the A or the B responses is ab — (1); therefore the 2 responses 
cannot be distinguished and this is expressed in symbols by 


А-В 


or, іп other words, В is an alias of А. Оп this basis and noting that the 
2 treatment combinations tested are from the principal block for con- 
founding AB, the defining contrast, Finney develops the following rule. 


If instead of a complete replicate of an experiment, only the treat- 
ment combinations forming the principal block of a certain confounding 
arrangement are used, there is no measure of the contrasts which would 
be confounded in that arrangement; these will be termed the defining 
contrasts. The product of any other contrast with a defining contrast 
is an alias of the former. 


Suppose that we have a 24 experiment with factors 4, B, C, D, and we 
take the principal block arising from confounding ABC. The block is 


|(D be а bed ab ac аа ас | 


The only measure of the response to A comes from a comparison of the 
first 4 treatments with the remainder. On writing out the treatments to 
be compared for an estimate of BC, however, we have 


(1) bc d bcd а abc ad аба 
ab ас abd. -acd.. с bd cd 


and those underlined are the only treatment combinations available for 
ап estimate of A. In other words BC is an alias of 4. This is given by 
the rule quoted above; for 


ABC x A — BC 
ABC is the defining contrast, and, multiplying this by 4, the product 
obtained is an alias of A. Similarly 


ABC x ABCD = D 


and D is an alias of ABCD. 

The conclusion from these considerations is that all effects will have a 
corresponding alias, and if an arrangement can be worked out such that 
the aliases of the important effects can be considered negligible the whole 
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of the observed effect can be assumed to be due to the former. Thus if 
ABCD is an alias of D it would be logical to assume that the result for 
this particular contrast is due to D. This can be done with a reasonable 
degree of assurance when there are several factors in the experiment. 

A 25 experiment can be designed in a !/, replicate taking ABCDEF as 
the defining contrast. Then we will have such alias sets as 


A = BCDEF 

B = ACDEF 

АВ = CDEF 

AC =BDEF 

ABC — DEF 
etc. 


There will obviously be no difficulty in such an experiment in assessing 
the main effects and 2-factor interactions. Three-factor interactions will 
be indistinguishable, but in any event they will be required for an estimate 
oferror. The outline of the analysis will be 


DF 

Main effects (type 4 — BCDEF) 6 
2-factor interactions (type АВ = CDEF) 15 
Error 10 
Total 31 


If we wish to reduce the size of the blocks, it is necessary to introduce 
confounding. Blocks of 16 can be used if it is satisfactory to confound 
two 3-factor interactions such as ABC = DEF. Blocks of 8 require the 
confounding of one 2-factor interaction. For example, we can confound 
CD, ABC, and ABD as well as their corresponding aliases ABEF, DEF, 
and СЕЕ. The constitution of the blocks for such a confounding 
arrangement are determined by another rule given by Finney [3], which 
is as follows, 


The principal block for a fractional replicate design is the same as 
the principal block of the complete replicate for a design which con- 
founds the same contrasts (including all aliases) together with the 
defining contrast. 
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Suppose that we wish to write the principal block for a 26 experunent 
confounding ABC, ABD, CD, and their aliases, as well as the defining 
contrast ABCDEF. We put these down as follows, 


Primary Contrasts Aliases Defining Contrast 
ABC ABD cD DEF CEF АВЕЕ ABCDEF 


and proceed to set up the principal block. Since it would be tedious to 
write out the principal blocks for all contrasts and find those that are 
common, we can adopt a shorter method which is assisted by another rule 
given by Finney [3]. This rule states that “the product of any 2 treatment 
combinations occurring in a principal block also belong to that block.” 
Thus if we find 2 of the combinations, their product will give a third and 
the next one that we find will generate others. Writing down some of the 
combinations arising from the principal block of ABC, we find that, for 
example, ab and ef are common to all interactions. Their product is 
abef. The next combination found to be common is acde. Multiplying 
this by those found previously, the complete principal block is obtained, 
which is 
|(D ab ef ај айе bede acdf Ыс | 


The remaining blocks are obtained by multiplication with the one restric- 
tion that, since all the combinations produced must belong to the principal 
block for confounding ABCDEF, they must contain an even number of 
letters. Choosing ac, ad, ae, we have, finally 


ac be  acef beef de а df abdf 
ad bd adef bdef ce асе of abef 
ае Бе af bf cd ака сај abcdef 


The analysis of variance will be of the form: 


DF 

Blocks 3 
Main effects 6 
2-factor interactions 14 
Error 8 
Total 31 


The method of analysis of fractional replicate experiments is a straight- 
forward application of the methods outlined for the analysis of 23 
experiments and should present no difficulty. 
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15. А 3? factorial experiment. The 2 factors A, В іп an experiment of 
this type are each at 3 levels. The 9 treatment combinations can therefore 
be designated as follows: 


B 
00 о! 02 
A 10 11 12 
20 21 22 


where 01, for example, represents a,b, in the conventional notation. 
Note that the level of A is always written first. 

Assuming that there is no confounding, the outline of the analysis of a 
3* experiment in 4 replicates would be 


DF 

Replicates 3 
A 2 
B 2 
AB 4 
Error 24 
Total 35 


An interesting and important fact in connection with 3% experiments 
follows from the properties of a 3 x 3 Graeco-Latin square as described 
in Chapter 11, Section 9. We noted there that the 8 degrees of freedom 
could be divided into 4 pairs that are mutually orthogonal. In the 3 x 3 
experiment this would involve the partitioning of the 4 degrees of freedom 
for AB into 2 orthogonal pairs. This is accomplished as follows by a 
procedure which is the same as assigning Latin letters and subscripts in a 
Graeco-Latin square. Let the treatment symbols represent yields in a 
3 x 3 square. 


B 
00 о! 02 
A 10 11 12 
20 21 22 


The row totals сап be represented by 46, 41, A» and the column totals 
by By, Bı, By. Then let the diagonal totals be 


[y = 00 + 11 + 22 Jo = 00 + 21 + 12 
I, — 10 +21 + 02 J, = 10 + 01 + 22 
I, = 20+ 01+ 12 J, = 20 + 11 + 02 
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Since in each of these totals one yield is taken from each row and each 
column, it is obvious that they are orthogonal with rows and columns. 
They are also orthogonal with each other. This is seen by noting that 
any total, such as I, contains one yield from each of the J totals. 

The J and J totals obviously represent the 4 degrees of freedom for 
AB, but the separation into / and J components is purely formal. Regard- 
less of any effects іп the experiment, the sums of squares for I and J 
would tend to be of the same magnitude. These facts can be utilized to 
make a direct calculation of the interaction sum of squares. For example, 
if there are r replications and G is the grand total, we have 


.xm e 
Ба 
ар G 
ФУ àr 9r 


and AB = I + J. If the interaction sum of squares has been calculated 
by subtraction, the above method furnishes a complete check on all the 
work. 

Although it is unlikely that we would want to do any confounding in 
an experiment with only 9 treatments, it is of interest to inquire how this 
would be done with a view to extending the principle to 33 experiments. 
The only possible division of the replicate is into 3 blocks of 3 plots each; 
therefore only 2 degrees of freedom can be confounded. The logical 
procedure is to select J or J. If we select I, the treatments placed in the 
blocks will correspond to those given above for the / totals. Assuming 
I to be completely confounded in 4 replicates, the form of the analysis 
will be 


DF 

Blocks 11 
A 2 
B 2 
AB (estimated from J totals) 2 
Error 18 
Total 35 


It is important to observe here that, although a portion of the 3-factor 
interaction is confounded, we still have a portion remaining which will 
enable us to obtain a test of significance of the triple interaction effect. 
The design outlined is such as to make it impossible to present a 3 x 3 
table of means because of the partial confounding of these totals with 
blocks. It is therefore more satisfactory to confound / in one replicate 
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and J in the remaining replicates. It is then possible to make appropriate 
corrections to the means for single-treatment combinations. To see how 
these corrections are made we can assume a set of yields fora 3? experi- 

: ment іп 2 replicates. J is confounded in the first replicate and J in the 
second. 


Block Block 
mean mean 
1 Treatment 00 11 22 J Treatment 00 21 12 
9 Yield 18 16 14 16.00 9 Yield IIS £16.67: 
А Treatment 10 21 02 Г Treatment 10 01 22 
1 Yield 16 14 13 14.33 ! Yield 18 13 18 16.33 
1, Treatment 20 01 12 J Treatment 20 11 02 
2 Yield 12 18. 17 15:67. * yield 13! 17: 19: 16233 
Replicate mean 15.33 16.44 


The treatments 00, 11, and 22 in block Jy occur in different blocks in 
the second replicate. Thus the mean of J, contains treatment plus block 
effect, but the mean of the 3 treatments 00, 11, 22 in the second replicate 
is free from block effect. The difference of these means less the portion 
due to the replicate means is the required measure of block effect per 
plot. We have, therefore, for block Jy 


= + 17+ 18) 
3 


1600 (16.44 — 15.33) = 0.89 


Similarly for block Jo 


ШЕ + 14+ 17) 
3 


1661 (15.33 — 16.44) = 0.77 


The full set of corrections are finally 


1, 0.89 T 0.77 
i 2.23 y 0.78 
оү ез 21 SS 


Since these are corrections per plot, they can be added in with the treat- 
ment totals. Thus for treatment 00 the corrected total is 


18.00 + 0.89 + 19.00 + 0.77 = 38.7 


(16) А 3% FACTORIAL EXPERIMENT 253 


16. A 3° factorial experiment. Since there are 27 treatment combina- 
tions in this type of experiment, there should be considerable gain in 
precision due to confounding. If the factors are A, B, C, the treatment 
degrees of freedom can be sorted out as follows. 


DF 

Main effects 6 
2-factor interactions 12 
3-factor interactions 8 
Total 26 


We look naturally for a method of confounding a portion of the degrees 
of freedom of the 3-factor interaction. Since the replicates contain 27 
plots, a logical procedure is to attempt to confound 2 degrees of freedom 
of this interaction, enabling us to divide each replicate into 3 blocks of 
9 plots each. The clue to the confounding method is given by the 
principle applied in Section 15 for dividing a 2-factor interaction for 4 
degrees of freedom into 2 orthogonal pairs. A continuation of the 
method enables us to divide the 8 degrees of freedom for a 3-factor 
interaction into 4 orthogonal pairs. The first step is to write out the 
treatment combinations. 


000 00 020 001 011 021 002 012 022 
100 110 120 101 111 121 102 112 122 
200 210 220 201 211 221 200 22 22 


The I diagonals of each square will give a new 3 x 3 square of J totals 
as follows. 


A, = (000 + 110 + 220) Y; = (001 + 111 + 221) 2, = (002 + 112 + 222) 
Ху = (100 + 210 + 020) Y, = (101 + 211 + 021) Z, = (102 + 212 + 022) 
Ху = (200 + 010 + 120) Y, = (201 + 011 + 121) Z, = (202 + 012 + 122) 


Taking both I and J diagonals from this square, we have: 


П 1Ј 
ОА KFY EZ 
Xa + Y.TZ 4+ + 23 


„ЕЕ; RA 
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The whole procedure can be repeated starting with J diagonals. Thus, 

X^ = (000 + 210 + 120) Ү = (001 + 211 + 121) Z^ = (002 + 212 + 122) 
X^ = (100 + 010 + 220) Y^, = (101 + 011 + 221) Z^, — (102 + 012 + 222) 
X’, = (200 + 110+ 020) ¥ = (201 +111 +021) — Z^ = (202 + 112 + 022) 


These yield 


JI JJ 
Xi Y +23 X^ + YS + Z^ 
X Үз d Z^ Xa+ Y' 4 Z^ 
X54 Y^45 Z^ X^5 + Үү 4 Z^ 


Each set of 3 totals represents 2 degrees of freedom of the 3-factor inter- 
action ABC. It follows that in order to confound 2 degrees of freedom 
among 3 blocks of a replicate it is simply a matter of placing the combina- 
tions occurring in one of the above sets of 3 totals in different blocks. 
For example, to confound /J the blocks would be as follows. 


Block 1 000 110 20 201 011 121 102 212 022 
Block 2 100 210 020 001 111 221 202 012 122 
Block 3 200 010 120 101 211 021 002 112 222 


If this plan is followed in all replications, there is complete confounding 
of the 2 degrees of freedom for ZJ and the outline of the analysis of variance 
for 4 replicates would be 


DF 

Main effects 6 
2-factor interactions 12 
3-factor interactions 6 
Blocks 11 
Error 72 
Total 107 


The 3-factor interaction sum of squares comes from pooling the sums of 
squares for each pair of degrees of freedom. Having confounded IJ in 
all replicates, the effects formally represented by //, JI, and JJ would be 
unconfounded and the totals would be taken over all replicates, The 
interaction mean square for ABC represented by 6 degrees of freedom 
provides a perfectly satisfactory test of the interaction effect, and it would 
seem at first that from the standpoint of simplicity it would be best to 
confound the same pair of degrees of freedom in each replicate. How- 
ever, if we do this, the IJ effect is completely sacrificed and it is then 
impossible to isolate certain components of the interaction for special 
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study. For example, we might wish to study the 3-factor interaction 
resulting from the linear responses of all factors, which in the notation of 
Fisher [6] would be A, B, C, and with complete confounding this effect 
could not be sorted out. 

The method of partial confounding in a 33 experiment would obviously 
be to confound a different pair of degrees of freedom of the 3-factor 
interaction in each replicate. Four replicates provide what is known as 
a balanced set as each pair is confounded once. For the reasons given 
in the preceding paragraph partial is preferred to complete confounding 
even if it is not possible to have a balanced set. A complete set of blocks 
which can be used for an actual partially confounded experiment is given 
in Table 12-6. 


TABLE 12-6 


BALANCED SET OF BLOCKS FOR A PARTIALLY CONFOUNDED 33 EXPERIMENT 


Block 
1 000 110 220 101 211 021 202 012 122 
П 2 100 210 020 201 011 121 002 112 222 
3 200 010 120 001 111 221 102 212 022 
1 000 110 220 201 011 121 102 212 022 
IJ 2 100 210 020 001 111 221 202 012 122 
3 200 010 120 101 211 021 002 112 222 
1 000 210 120 101 011 221 202 112 022 
JI 2 100 010 220 201 111 021 002 212 122 
3 200 110 020 001 211 121 102 012 222 
1 000 210 120 201 111 021 102 012 222 
ЈГ 2 100 010 220 001 21 121 202 112 022 
3 200 110 020 101 011 221 002 212 122 


17. Exercises. 


1. Table 12-7 gives data taken from a uniformity experiment with mangels, con- 
ducted by Summerby [9]. Тһе yields аге in pounds of dry matter per plot, А partially 
confounded 2? has been assumed оп 32 plots as shown in the table. 

Make an analysis of the results and obtain treatment means corrected for block effect. 


Error (SS) = 56.867. АВ = — 1.233. ac (corrected) = 30.4. 
2. Design complete layouts for the following experiments: 
a, To compare treated and untreated seed of 12 varieties of wheat. The variety 
comparison is not important. 
5. To compare 2 fertilizers each at 4 levels. 
c. To test 3 fertilizers at 3 levels, each with 3 cultural treatments. 
4. A preliminary experiment involving 6 factors, each at 2 levels. 


3. Assume that the experiment designed in 2с above is superimposed on a section 
of the yields of Table 2-3; then make an analysis of the results. 
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TABLE 12-7 


YIELDS IN A 2° EXPERIMENT WITH PARTIAL CONFOUNDING 


I E IL, II, 


т 


ab 240 5 262 (1) 27.5 ab 27.0 
abe 29.6 а 28.2 abc 25.9 с 30.8 
(1) 30.0 be 29.1 ас 27.4 be 32.8 
с 305 ас 252 6 296 а 332 


Block total 114.1 108.7 110.4 123.8 


ш, ш, IV, ІУ, 


abe 30.2 Ь 31.0 (92-317: а (33:5 
а 324 ab 34.4 be 298 abe 33.2 
be 319 с 352 ас 35017027356 
(1) 31.6 ас 34,2 ab 292 с 324 


Block total 126.1 134.8 123.2 134.7 
G — 975.8 
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CHAPTER 13 


Incomplete Block Experiments 


1. Meaning of incomplete blocks. An incomplete block is one that 
does not contain all the treatments that go to make up a complete replica- 
tion. It is not a new concept at this stage of our study of experimental 
design because we have met with it twice previously, once in split-block 
experiments and again in factorial experiments in which interaction effects 
were confounded with blocks. In order not to have any confusion of 
ideas, the incomplete block experiments now to be described must be 
thought of as a group in which there is no basis for discrimination between 
comparisons with respect to accuracy. In split-plot experiments we are 
satisfied to sacrifice the accuracy of the comparisons for one main effect 
in order to increase accuracy for another. In a factorial experiment with 
confounding we are satisfied to lose accuracy on a high-order interaction 
or even to completely confound it in order to gain accuracy for comparisons 
involving the main effects and interactions of a lower order. A typical 
incomplete block experiment would be one designed to test a number 
of varieties. It is obvious in such an experiment that ordinarily there 
is no basis for deciding that certain comparisons are more valuable 
than others. 

With the above general concept in mind it is not difficult to see 
the intimate relation between all incomplete block experiments in a 
broad sense and those to which the term is specifically applied wherein 
all comparisons are to be made with at least approximately equal 
accuracy. 

2. An elementary type of incomplete block experiment. When the. 
number of treatments to be tested is quite large, we have already noted 
(Chapter 12, Section 6) that randomized blocks do not exercise sufficient 
control over the error. In an attempt to overcome this, experiments were 
devised in which the varieties were divided into groups. The groups 
remained the same in each replicate so that the effect is similar to that 
obtained in a split-plot experiment. The analysis would follow the same 
general plan, there being one error for intra-block comparisons and 
another for inter-block comparisons. Suppose that we have 20 varieties 
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in 4 groups of 5, and there are 6 replications. The analysis would be as 
follows. 


DF MS 
Replicates 5 
Groups 3 
Error (a) 15 E, 
Within groups 16 
Error (5) 80 Ey 
Total 119 


For the comparison of varieties within a group the variance of a mean 
difference would be 
2E, 1 


and for the comparison of means of group we would have the variance 


The difficulty with such an experiment arises from the need for comparing 
varieties in different groups. The error variance for the difference 
between 2 means is a weighted mean of E, and E,, 


(ЗЕ, + 1E.) 3 


To see what this really means we should examine the situation over the 
entire range of the ratio E,/E, that it would be reasonable to expect in 
many experiments. This would of course be quite difficult, but individual 
cases can be studied. Experience shows that the ratio E,/E, — 3 is a 
definite possibility. Taking this value and comparing formulas (1) and 
(3) on this basis, we have 


Within groups aE, 
Between groups СЕ, ЕТЕ = Е, 


The ratio of the second of these variances to the first = 7/5 = 1.4, which 
is a measure of the relative efficiency of the two kinds of comparisons. 
This result сап be interpreted by saying that 40% more replication would 
be required in order to increase the accuracy of the comparison of 
varieties in different groups to the point of accuracy obtained in 6 replicates 
for comparing varieties within groups. This is too great a discrepancy 
on the assumption that our actual requirement is to make all comparisons 
with approximately equal accuracy. 
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3. Lattice experiments. In discussing the type of experiment described 
in Section 2, Cochran [2] states: 


+ + + the design - - - is at fault in keeping the same groups of varieties 
together in all replications, a pair of varieties either appearing always 
in the same group or never in the same group. It is clearly better 
to make the opposite rule, that a pair of varieties which are in the 
same group in the first replication shall not appear in the same group 
in a subsequent replication. This rule leads to the construction of 
lattice designs. 


Lattice designs are incomplete block designs for which the number of 
varieties or treatments can be set up in the form of a square lattice* with 


Ficure 13-1. Diagram of a 3 x 3 lattice. 


variety numbers at the intersection of the lines. А 3 x 3 lattice is shown 
in Figure 13-1 where the varieties are represented by 2-digit numbers, 
the first representing the rows and the second the columns. 

Lattice designs were first described by Yates [17] and were known in 
the first place as pseudo-factorial designs, later as quasi-factorial designs, 
and finally as lattice designs. There has been considerable development 
since the first description in 1936, and much the greater part of this has 
also been due to Yates. 


* A design made up from a rectangular lattice is possible, but the fnalysis is some- 
what more complex. See Yates [17], Harshbarger [12, 13, 14), and Cochran and 
Cox [4]. 


260 INCOMPLETE BLOCK EXPERIMENTS ШЕ) 


The procedure of allocating the varieties to groups so that а pair of 
varieties appearing together in the same block shall not appear together 
in the same block in subsequent replicates may be illustrated for the 
simplest case with the 9 varieties shown in a 3 x 3 lattice in Figure 13-1. 
In one replicate the blocks may consist of the varieties in the rows of the 
lattice, giving the replicate 


ШІ 2122 BID DS ҚОЗ 329 331 


Then іп the second replicate the blocks can be made up from the columns, 
giving 

ИШ Es Пол УД e395 КЕ A 733: 
This gives us 2 replicates, and if we wish these can be repeated in an 
experiment containing 4, 6, 8, 10, or more replicates. There are other 
possibilities, however, and these should be studied before going into 
further details with respect to methods of analysis. 

4. The simple lattice. The simple lattice is frequently known as a 
square lattice. There are only 2 types of blocks, those made up from the 
tows of the square and those from the columns. We shall refer to these 
as group 4 and group B. It requires a minimum of 2 replicates, and if 
more replicates are required the groups are repeated as many times as 
desired. 

5. The triple lattice. A third group in addition to those required for 
a simple lattice, is added by making up blocks from the diagonals of the 
square. Referring to Figure 13-1 of a 3 x 3 lattice, the third group, 
group C, would be 


21 2. BIA CE 3] 


The minimum number of replications is 3, but repetition of these gives 
6, 9, or any multiple of 3. 

6. The quadruple and other partially balanced lattice designs. Suppose 
that we have 25 varieties with the corresponding numbers arranged in a 
5 x 5 lattice. Table 13-1 (a) shows the original square from which 
groups A and B from the rows and columns can be derived. If we write 
out a new square putting those numbers in the rows that occur in the 
diagonals of the original, we have (b), which gives us group С. Con- 
tinuing by writing another square from the diagonals of the square for 
group C, we have (c), which provides group D. The 4 groups would 
give us a quadruple lattice just as the 3 groups give a triple lattice. 

Tt will be obvious with a little study that the groups formed by these 
methods are orthogonal to each other. For example, on examining any 
block such as 31, 52, 23, 44, 15 in group D, we note that in group C each 
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of the varieties comes from a different block. Similarly they come from 
different rows and different columns of the original square. The question 
then arises as to how many orthogonal squares can be formed. This 
problem has resulted in an interesting series of studies by well-known 
statisticians. For all squares where p is a prime number or a power of 
a prime number it is possible to write out p + 1 orthogonal groups. If 
pis a prime number, the groups сап be written consecutively following the 
simple mechanical method outlined above for the 5 x 5 square. For 
the squares where p = 4, 8, 9 the orthogonal groups can also be written 
by rule,* but completely orthogonalized squares are given by Fisher and 
Yates [10]. The 6 x 6 square cannot be carried beyond the third group 
which is of course possible for all squares. The 10 x 10 has not yet been 
carried beyond 3 groups, and the 12 x 12 beyond 4 groups. These are 
given by Cochran and Cox [4]. When the completely orthogonalized 
square can be written, experiments in which all groups are used, one in 
each replicate, possess special properties and are referred to as balanced 


lattice designs. 


TABLE 13-1 
Original Group C Group D 
11 12 13 14 15 11 22 33 44 55 11 32 53 24 45 


21 22 23-24 25 21 32 43 54 15 21 42 13 34 55 
31 32 33 34 35 31 42 53 14 25 31 52 23 44 15 
41 42 43 44 45 410552: 13,024: 535 41 12 33 54 25 
51. 152753) (SS 51 12 23 34 45 51 22 43 14 35 


@ (b) (c) 


7. The balanced lattice. The method of laying out a balanced lattice 
design will be clear from the discussion in Section 6. It will also be 
obvious that except for the 6 x 6 and 10 x 10 squares a partially balanced 
lattice can be designed with any number of replicates between 2 and 11, 
inclusive, but for complete balance the minimum number of replicates is 
р 4- 1. Thus for 64 varieties we require 9 replications. The advantage 
of the balanced lattice is that all comparisons are made with equal 
accuracy and the method of analysis is somewhat simpler than for partially 
balanced designs. Its chief disadvantage is the number of replicates 
required. It is ideal Zor 25 varieties since ordinarily in variety tests we 
require about 6 replications. It is also satisfactory for 49 varieties if the 
conditions are sufficiently variable to warrant 8 replications. In partially 


* See H. B. Mann, Analysis and Design of Experiments, Dover Publishing Co., 
New York, 1949. 
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balanced designs certain pairs of varieties do not occur together in the 
same block and are therefore compared with somewhat less accuracy 
than those that do occur in the same block. This difference is greatest 
in the simple lattice and decreases with the number of groups used. 
Even for the simple lattice, however, the difference in the accuracy 
of the comparison of varieties occurring in the same and in different 
blocks is. not great enough to be of much concern. The appropriate 
standard errors for the two kinds of comparisons can be worked out, 
but in general it is satisfactory to apply a mean standard error to all 
comparisons. 

8. Theory of lattice designs. The principles upon which lattice designs 
are based are most easily explained with the simple or square lattice as 
an example. Suppose that we have a 9-variety experiment with the 
varieties designated as in the square below.* 


11 12 13 
21 22 123 
31 9 EE 


In а 2-replicate experiment these would be arranged in blocks and repli- 
cates as follows, and then the position of the blocks within replicates and 
varieties within blocks would be randomized. 


Replicate 1 | 11 12 13 | 21 22 (8224091 32 33 | 
Replicate 2 | 11 21 ЗІ P Ру 01613 23 33 | 


The yields from replicate 1 can be represented by the symbol x, those 
from the second replicate by у, and the varietal totals by t. These can 
be assumed set up in 3 squares as shown in Table 13-2 with marginal 
totals represented by the capital letters. The dotted lines group the 
varieties into the blocks in which they occur in the field. 


TABLE 13-2 
Ха Ха Xs | X1 Уа ү. hi hae бз | Т. 
Ха Ха хз X. Ya nn їз із із | Ta. 
Хи Xa Ха Xy | Ja Н Ys. tsı faa із | Ts. 
Me o UE |, YO е TAO Т. 


* This experiment would not usually have sufficient varieties to warrant using an 


incomplete block design, but an example of this size is sufficient for explanatory 
purposes. 
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From the square on the right it can be seen that the 8 degrees of freedom 
for the variety totals can be divided into 


DF 
Rows 2 
Columns 2 
Interactions E 
Total 8 


which corresponds with a 2-factor experiment in which the rows represent 
1 factor and columns the other. Thus if the factors are A and В we 
would have 


DF 
A 2 
B 2 
AB 4 
Total 8 


With such an arrangement, in replicate 1 the factor A would be confounded 
with blocks, and in replicate 2 the factor B. This is merely partial con- 
founding as it was outlined in Chapter 12, but here we have confounded 
the main effects and left the interaction unconfounded. Of course this 
makes no difference here as the classification of the variety effects into 
two main effects and their interaction is purely formal and for the purpose 
only of developing a method of analysis. 

If there is a real block effect, the result of confounding is to make 
those components of the varieties represented by A and B too large. 
Thus the A effect is obtained from the totals Ту, 7,, and Т, and these 
are made up as follows. 


T, — X, + Y, 
Т = ХУНУ; 
TU PAI 


Since Х,, Х,, and Ху. come from different blocks, they will contain 
block effects. Similarly the totals Ту, T, and Ту, representing В, 
contain block effects from the second replicate. The interaction com- 
ponent is made up of variation among totals such as 


thy + fag + tas = (Xu + Xe + Хз) + Wu Yos + Уә) 
toy + faz + ths = (Ха + Хз + ^з) + (уз + уза + Уз) 


їз + tye + tog = (Ха + Ха + Xas) + (Уа + Jie + Yes) 
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these representing 2 out of the total of-4 degrees of freedom. Each total 
contains 1 yield from each of the 6 blocks in the experiment; therefore 
these totals do not contain block effects. 

The next step is to derive estimates of A and B from which the block 
effect has been removed. Thus the mean 7,./6 = 1. is the unadjusted 
mean for the first level of A, and f, for the first level of B, both of which 
require correction for block effect. If we can obtain estimates of these 
from which the block effect has been removed, the adjusted varietal means 
can be constructed. This follows from the fact that an unadjusted mean 
for variety 12, as an example, can be expressed as follows, where 7). and 
1 are the corresponding row and column means, and m is the general 
mean. 


Vig — m = (Һ.— т) + (£5 — m) + (vg — t. — Tg + m) 4 


In other words the deviation of a mean of a variety from the general 
mean can be expressed in terms of the deviations of the row and column 
means plus an interaction term. The origin of the latter is more obvious 
if it is expressed as 


(vg — m) — (4. — m) — (15 — m) 
which shows that it is the total deviation of the variety mean less the 
deviations due to corresponding row and column means. 

For further analysis of the situation it is simpler to adopt symbols 
similar to those employed by Cochran [1], whose explanation is followed 
very closely in this discussion. Ап estimate А, ог B, where i stands for 
intra-blocks, can be made which is free from block effect. Thus for the 
first level of A, A; = Y,/3, and for the first level of B, B, = X4/3. 
Similarly we have estimates 4, and B, that are confounded with blocks. 
Again, at the first level we have 4, = ¥,/3 and В, = Yj4/3. The 
combined unadjusted effects are represented by 4, and By. That is, 


A A 
Eg Ea 


and for the first level these would correspond with 4. and їл. 

What we require are adjusted values of А and В. These can be obtained 
if suitable error variances for the intra-block and inter-block effects can 
befound. Assuming that these error variances on a single-plot basis are 
o? and c^, then good estimates of A and B arise from the simple procedure 
of using о? and c^? as weights. Thus, if w= 1/02 and w' = 1/o?, the 
weighted esumiates of A and В are 


HS wA; + wA i WB; + w'B, 
w+w ww 
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This gives us a basis for estimating A and B from all the information 
available. Historically this point is very interesting, because when the 
lattice designs were first presented by Yates [17] he was not certain that 
a sufficiently reliable estimate of o^? could be obtained, and he discarded 
A, and В, entirely, taking 4; and В, only for the estimates of A and В. 
Now when c"? is equal to о?, A; and A, are equally important and therefore 
A, cannot be discarded without loss of information. This condition is 
obtained when the blocks are of no value in error control, that is, when 
the variation between blocks is no larger than within blocks. To discard 
A, and B, under these conditions is to make the experiment less efficient 
than if it had been laid out in randomized blocks. Yates [19, 20] then 
developed the method outlined above wherein due weight is given to the two 
components of A and B. The result is completely satisfying because, if 
o? = о, then there is no weighting and A becomes the simple arithmetic 
mean of A, and A,, and there are no corrections to be applied to the 
variety means. Yates [19] also showed that a lattice experiment could 
be analyzed as if it had been laid out in randomized blocks and there 
would be no bias in the tests of significance. This means that, when the 
results of the lattice experiment indicate that error control by means of 
the blocks is not effective, the logical procedure is to discontinue the 
analysis as a lattice design and continue as for randomized blocks. Thus 
the lowest efficiency that can be reached in a lattice design is that of 
randomized blocks.* 

Having obtained estimates of 4 and B, we can follow equation (4), 
writing it first in terms of the symbols just described. This gives us 


yo — m = (Ao — т) + (Во m) + (vy — 4 — By + m) 7 


where v, is an unadjusted variety mean. From the adjusted values of 
A and B we will have 


y— m= (A— m) + (B— т) + (v — A— B + m) 8 
and, on subtracting the first from the second, 
y— v = (4 — 40) + (B— By) 


or 
y = vo + (4 — Ao) + (8 — Bo) 9 


* There is no bias in the F test when results from a lattice experiment are analyzed 
as if coming from a randomized block experiment. This does not apply strictly to 
1 tests because the accuracy of all comparisons is not the same, but the variation in 
accuracy is small. Also the lowest efficiency of a lattice design may actually be slightly 
lower than for randomized blocks if we take into consideration the error in estimating 
the weights that are used in deciding between the lattice analysis and the randomized 
block analysis. 
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and this can be applied in making corrections to variety means. How- 
ever, it is more convenient in practice to put it into a different form. In 
the first place, with some manipulation, we can show that 


w—w 
A— A= Ag—A 
0 yeu Ce ») 
w—w 
— Bo = Bo— B 
B— By =~—~ (By— В) 


For convenience we put A = (w— w’)/(w + w’); then equation (9) is 


y = yo + (Ao — 4) + (Bo — By) 10 


In order to work out the actual correction to a variety mean, say to 
variety 11, we have to obtain Ao, A», Bo, and B,, for the first level of A 
and B. Thus 


Ay = E Au = * 
Then 
us Ад 3-44 езе t X. 
Similarly 
Ву = Xa t ra 

Therefore 

MV Ү,. + whe 5. = X,. 
and 

EUN و و‎ Хас Yı 


6 3 6 


Representing these as с”, and с',, respectively, the complete expression 
for a corrected variety mean is 


Ver = Vo + Ac', + Ac’, 11 


where ef is the 2-digit number representing a variety. 
In general when the simple lattice design is for p® varieties and there 
are 2 repetitions of each group, making a total of 4 replications, we have 


NICE ne л? 
Ca 2; ce E 1 


12 
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There will be a value of c’, for each row of the table of variety means 
and of с”, for each column. These and the values of c, = Ac', and 
C, = Ac’, can be worked out first and entered in the margins of the table, 
and then applied to the unadjusted variety means by addition. 

Our explanation up to this point depends on being able to make reliable 
estimates of о? and о? in order to obtain the required weights, w = 1/0? 
and и” = I/o’. The former will present no difficulty. It is comparable 
to intra-block error in a confounded factorial experiment and is therefore 
the mean square E; for intra-block error in the analysis of variance. It 
can be obtained in the analysis by methods similar to those employed in 
Chapter 10 dealing with the analysis of non-orthogonal data. Thus either 
of the analyses given for the 9-variety experiment with 2 replications yield 
the intra-block error mean square. The second method is the one that 
we shall apply because it yields values of E, and Е, from which an estimate 
of o? can be obtained. The first method would obviously fail in this 
respect because Е”, contains variety effects. 

DF MS 


Method 1 Replicates 


1 
Blocks (ignoring varieties) 4 Е% 
Varieties (eliminating blocks) 8 
Intra-block error 4 E; 
Total 17 
DF MS 
Method 2 Replicates 1 
Blocks (eliminating varieties) 4 Е, 
Varieties (ignoring blocks) 8 
Intra-block error 4 E; 


Total 17 


In accordance with the principles of the analysis of variance, for an 
experiment with p? varieties with p units in each block the estimate of 
block effect that we require is o? = o, + ро, where o, represents the 
effect of blocks alone. That is, if the variation due to blocks arises only 
from plot variation, е? will be equal to zero. If the blocks containing 
the same sets of varieties are repeated as in a 4- or 6-replicate experiment, 
a direct estimate of с,2 can be made from the differences between such 
blocks. In the 9-variety experiment such blocks ав |11 12 ІЗ | will 
occur twice and differences between them freed of replicate effect will be 
due entirely to blocks. This component of block effect commonly 
referred to as component (a) would correspond to only 4 out of the 
8 degrees of freedom for blocks in a 4-replicate experiment. The other 4 
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represent what is commonly referred to as component (0), the value of 
which arises from the fact that it can be estimated without repeating the 
blocks. Thus for the 9-variety experiment with only 2 replicates we have 
differences between such totals as X,. and Y, which contain the same set 
of varieties. When freed of replicate effects they must represent blocks. 

The best way to understand clearly how components (a) and (5) are 
obtained and what they represent is to visualize a 4-replicate simple lattice 
experiment for which the data are set up as in Table 13-3. 


TABLE 13-3 


DIAGRAMMATIC REPRESENTATION OF DATA FROM A SIMPLE 
LATTICE EXPERIMENT 


Group А Group B 
Replicate 1 Replicate 2 Replicate 3 Replicate 4 
ME STE 18 28 oe ile 2 13 H 12 13| 
ОО ОЗ да 21 22 2205,1 21 22 23 21 22 23 
313492733 | а 31 32 32 |ғы 31-32 33 | 21 327 32 | 
б, | ба біз n Bes | ба gu Su m б, 
ад «Фа аз | Xi bu Әз bs | Ne 
ау айз аш | Xo ba ба ba | Ys 
ау йз аз | Ху. bs, ӛз ӛз | Ys 
DIS а Үл Ya Ya| Y. 
hy іш а | Ty. 
іш із ta | Ty. 
ta іш іш | Т). 
ТОЛ (Т, |Т. 
+ Component (a) can be calculated іп 2 parts as follows. 
(£u — 812)" + (£x — 822)? + (gn — Bae)” (G1 — б) 
хз 22519 
(£1 — 81a)” + (B25 — Bas)” + (gaa — Bas)” (G3 — Са)? 
2x9 2x9 
Component (5) can also be obtained in 2 parts. 
(Xr Y, + (Xe. — Yo)? + (X3.— Yap (X.— Y.E 
4x3 4x9 


(4a = Yay + CCS Yo? (X5 — Ys? (X.— Y? 


4x3 


4x9 


13 


14 
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A formula for component (5) that is generally more convenient arises 
from such equalities as 


Te. 0% = OG, Yo 2X) E ОА; 


The complete expression is then 


(Ty 2X)? + (Ty — 2X, + (Ta 2X)? + (Ta 2 Ya)? + (Tig 2 Ye)?  (74-2Y.9* 
4x3 


15 
_ 2X. = Py! 


4x9 


An interesting and important fact with respect to components (а) and 
(b) is explained in detail by Yates [17, 18] and Cochran [1]. The expecta- 
tion of the mean square for component (a) is a? + ро,” but that of 
component (b) is о? + !/po;?. The sense of this arises from the fact 
that the 2 parts of component (5) come from the same data arranged in 2 
different ways and are therefore not completely independent. It means 
that, if an estimate of с? + po, is obtained from component (b), some 
allowance must be made for the different expectation. Thus if Е,, the 
mean square for component (0), can be equated as follows, where E; is 
the mean square for intra-block error, 

E, = Е, + ipo) 
2E, = 2E, + pos? 
2E,— Е, = Е, + pos? 
Therefore 
1 


ШЕ д” 16 
SENE, 


The combined mean square for components (a) and (b) has an expecta- + 
tion of o, + ?/,pa,?; therefore 

E, = Е, + ipo 

AE, = 4E, + pos? 
= E; = Ey + pos? 

and 
[ps 17 

КУС АБ EY 


It should be pointed out in this connection that the method of weighting 
employed here is approximate and not exact, and consequently there isa 
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slight loss of efficiency. The exact method is rather complex and makes 
во little difference that it is therefore not worth while. 

Finally we must have appropriate variances for the differences between 
variety means. These are given by Cochran [1] in the following form. 


V, (varieties occurring together in the same block) 


2 2w 
1 18 
rwp [e ) w+ 4 
V, (varieties that do not occur in the same block) 
2 4w 
19 
rwp [e 2 w+ ә 
V,, (mean variance for all pairs of varieties) 
2 | 4w | 
ңе езу Б ; 20 
ғәр + 1) PTD w+w 


If we take А = (w— w’)/(w + w^), the above formulas can be expressed 
in the simpler and more fundamental form, 


ЖЕЛЕ 21 

r р 

У. = ar ! +2 (à) 22 
r P 

ЖЕТТЕН 23 
r pl 


These formulas are very instructive with respect to the effect of the 
"block variance. Assuming for simplicity that w' is estimated from 
component (а) only, we have w' = 1/Е,. Then 4 = (E, — ЕКЕ, + ED. 
When E, < Е, A = 0 and all the formulas reduce to 2£,/r, which is the 
same as for randomized blocks. As Е, increases without limit, it is easy 
to show that A approaches 1. The three formulas then become 


v,= (24h) à 24 
А р 
v= (et 25 
P P 
| 
ТЫ N+ 26 
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which are the formulas appropriate for the type of analysis where the A 
and В effects are estimated from 4; and В; only, or in other words when 
no information is recovered, with respect to A and В, from A, and В,. 

If we require an F test of the significance of the variety variance, it 
should be clear that an adjustment must be made in the variety variance 
calculated directly from the variety totals, as this will contain block effect. 
For the simple lattice Cochran [1] gives this adjustment as 


EET 27 
Ww 


where S, and S, are, respectively, the unadjusted and adjusted sums of 
squares for component (b) S, is calculated as explained above for 
component (b). Referring to Table 13-3, S, will be 


хы хы Ха Ys Var + Yo ҮЗ 
6 18 6 18 


The adjustment is shown with a negative sign, indicating that it must 
always be subtracted from the unadjusted variety sum of squares. After 
this adjustment the F test can be made in the usual manner. 

9, Lattice designs with from 2 to p groups. As pointed out in Section 6, 
triple, quadruple, and other partially balanced lattice designs can be made 
up by writing out additional orthogonal groups. The only restriction on 
designs of practical significance (those for p — 3 to 11) is that for p — 6 
and 10; we cannot go higher than the triple lattice. If we write р + 1 
groups, the design becomes a balanced lattice and has special properties 
that justify our placing it in a separate class. 

There is little difficulty in the analysis of designs of higher order than 
the simple lattice. Let k — the number of groups, and the formulas 
given above can be restated for the general case. These are given below 
and a convenient analytical procedure outlined. 

A valuable preliminary step is to perform an analysis of variance as if 
the experiment had been laid out in randomized blocks, as all the sums. 
of squares determined will be required in the complete analysis. An F 
test of the significance of the variety mean square can be made, and if the 
F is significant further tests will not be necessary as in the complete 
analysis the significance may be increased but cannot be decreased. The 
outline of the analysis will be as follows. 


28 


DF 
Replications r—1 
Varieties 2—1 
Error (p — 1/ғ-- 1) 


Total ie 
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Assuming that the variety differences are significant, we proceed to the 
detailed analysis wherein the first major objective is to obtain block sums 
of squares from which the effect of varieties has been eliminated. There 
will be 2 forms of the analysis, depending on whether the groups are or 
are not repeated. These are given in outline form below. 


Groups Not Repeated (r = k) Groups Repeated (r = 2k) 
DF DF 
Replicates p=} Replicates pal 
Component (b) r(p— 1) Component (a) k(p— 1) Homi) 
Component (0) k(p— 1) 
Varieties p Varieties p-1 


Intra-block error (p— 1)(rp— p— 1) Intra-block error (p— (rp — p — 1) 


Total peal Total rp 1 


Note that we could have r = 3k, r = 4k, etc., but such designs are not 
of much practical importance since the main objective will be to increase 
the number of replications by increasing the number of groups rather 
than the number of repetitions. Thus for a 6-replicate experiment we 
have the choice 6 = 3 x 2, or 6 = 2 x 3, where the last figure is К. 
The latter design is to be preferred, so it is hardly necessary to consider 
the other. 

For any variety such as 12, let by. represent the total of the block in 
which it occurs, or, when k > 2, the total of all similar blocks. Referring 
to Table 13-3 it is noted that there are 2 values of Ру, X,. in group 4 
and Y,ingroup В. Then let u, represent a total of all varieties occurring 
in the same block with 12. Again from Table 13-3, one value of 149 is 
Т,., and the other one is Tẹ. Then for variety 12 the quantity И/ is 


Wu = (le ON) Кы — 2Y 


or in general, where ef represents the variety number, 
k 
W,, = XQ. — kb) 29 


and the p* quantities so determined yield the sum of squares for com- 


ponent (5) and the corrections to the variety means. Component (5) sum 
of squares is simply 


7 
2 W.P) 


r(k — Dp? 2: 
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Component (а) will be calculated by obtaining the sum of squares for 
differences between similar blocks. We can then set up the analysis and 
determine the mean squares E, and £;. 

The next step is to determine the values of w and w'. The value of w 
is always given by 1/Е,, but the method of calculating w' depends on the 
number of degrees of freedom available for estimating component (a). 
If n represents the number of repetitions of the groups, a table can be set 
up as follows, giving the degrees of freedom available for estimating 
component (а). Each quantity in the table is to be multiplied by (p — 1). 


k 

2 3 4 5 ete 84 

1 0 0 0 0 Qe es n 
E SMO O M 
3 @------ ©- 8 10 у. 
4 6 9 12 15 18 T 


The general formula is k(n — 1)(p— 1). Since a lattice experiment will 
rarely if ever be used for p? less than 25, the numbers circled and connected 
by a dotted line indicate those for which there are sufficient degrees of 
freedom to give a fairly reliable estimate of the block mean square from 
component (a) alone. The same will apply of course to those cases 
below the circled ones. In the first row we have component (5) only, 
so that there are actually only 2 situations in which components (a) and 
(b) are both required in order to obtain an estimate of the mean square 
for blocks. For these we have 


E r—1 31 


This is actually a general formula because when we have component (5) 
alone it is also correct for the calculation of w’. When component (а) 
alone is sufficient, we have 
1 
"= — 32 
== 
Е, 
The general formula for 4 is 


i Мей 33 
Асса Ге Dw + 2 


Then a corrected variety total is given by 


| Wes 34 


4 
ey a AE (ер 
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and the variances for differences between variety means are 


ЕСІН ЕДІН! 2 
= |. + bs ) m 7) 3 


Finally, the adjustment to the variety sum of squares in order to make 


an F test is 
w' 
5,- $ 38 
JL tal * sj 


10. Analysis of a balanced lattice. As already noted, a balanced lattice 
arises from the use of p -- 1 orthogonal groups. The method of analysis 
is essentially the same as for the partially balanced lattice but with certain 
simplifications. 

In the first place it is easier to calculate the values of W,;. The reason 
for this can be seen from examining the plan for a balanced lattice with 
9 varieties as given below. 


Ше 12718 23012-31 ІІ. 5202733 Ше 324.23 
2105220 2 ОБОХ 2 21320 1 3 
310032: 733 19 2 33 3p 12 23 ЗІ 2257 13 


For апу one variety such as 11 we can write the value of the и for each 
block as follows. 


(ty + fa + tas) + (tu + ta + fon) + a + fas + tas) + (йа + faz + log) 


which is obviously equal to 34; + б, where G is the grand total for the 
whole experiment. We have then 

Wy, = 3t + G— 4Xby 
or in general 

Ж, = рі, + G— rX(b,j) 39 


The formula for the sum of squares for component (5) is 


EW.) 


rk — Dp? E 
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as in the case of the partially balanced lattice. Also a corrected variety 
total is 


2 
лыр 
f f (k — Dp. 24 
where 
w—w 
МИ е. =I E 
md о] os 
and 
MCN USE 
EU ree: 


There is only one variance for differences between means as all comparisons 
are made with equal precision. This, as would be expected, is 


EE) 42 
r p 


showing that in the balanced lattice all comparisons are of the same 
precision as the most precise comparison in a partially balanced lattice. 

Finally the sum of squares for varieties, adjusted so that it can be 
compared with the intra-block error, comes from the sum of squares of 
the corrected variety totals, divided by the factor (1 + Мр). It may be 
expressed therefore as 


p 
> (rv, e 

2 
„р р 43 


TE 
Р 


11. Example 13-1. Partially balanced lattice. Table 13-4 gives the 
yields in bushels per acre for a 4-replicate quadruple lattice experiment 
with 25 varieties of wheat. Variety totals are given in Table 13-5. Тһе 
preliminary analysis of variance is given in Table 13-6. 

The next step to obtain the complete analysis is to find the sum of 
squares for blocks from which variety effect has been eliminated. To do 
this we require the quantities W,, calculated from formula (29), from the 
values of u, and b, as given in Table 13-4. For example we have 


Иа = — 173.9 — 153.2 + 262.4 + 63.0 = — 1.7 
These have been calculated and set up in Table 13-7. From formula (30) 
the sum of squares for blocks is 
282,806.58 
300 


— 942.689 
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TABLE 134 


Pror YIELDS, BLOCK TOTALS AND VALUES OF 4 AND u— kb, FOR 
4-REPLICATE QUADRUPLE LATTICE 


Replicate 1 Replicate 2 Replicate 3 Replicate 4 
Group Group Group Group 
Block Variety А” Variety B Variety C Variety D 
11 27.4 11 37.2 11 23.4 11 20.0 
12 38.6 21 37.5 22 19.0 32 20.8 
1 13 29.1 31 18.0 33 11.8 53 21.6 
14 21.3 4l 422 44 13.3 24 23.9 1 
15 341 51 43.3 55 18.9 45 25.5 | 
b 150.5 178.2 86.4 117.8 
и 575.5 578.8 584.4 576.8 
u— kb - 26.5 - 134.0 238.8 105.6 
21 36.6 12 40.9 21 22.5 21 23.8 | 
22 46.1 22 442 32 227 42 38.5 | 
2 23 44.7 32 36.7 43 25.0 13 25.4 | 
24 40.6 42 34.8 54 22.3 34 30.1 | 
25 37.3 52 35.8 15 34.0 55 29.8 
2254 р 
b 205.3 192.4 126.5 147.6 y 
u 647.3 616.3 628.5 597.6 
u— kb — 173.9 — 153.3 122.5 12 
31 16.7 13 31.6 31 18.0 31 21.0 
32 30.0 23 45.0 42 19.3 52 21.5 
3 33 28.0 33 37.8 53 25.0 23 34.5 
34 33.3 43 40.8 14 19.9 44 21.0 
35 26.2 53 36.4 25 28.3 15 32.5 
b 1342 191.6 110.5 130.5 
u 516.6 613.2 536.4 585.0 
u— kb — 202 — 153.2 94.4 63,0 
4 401 14 24.4 41 22.7 41 39.2 
42 25.6 24 28.0 52 25.0 12 39.9 
4 43 39.7 34 38.6 13 22.4 33 28.4 


44 36.4 44 38.8 24 25.4 54 28.2 
45 37.3 54 36.0 35 24.4 25 32.7 


(1) 


PARTIALLY BALANCED LATTICE 277 


TABLE 13-4—(continued) 


Replicate 1 Replicate 2 Replicate 3 Replicate 4 
Group Group Group Group 
Block Variety A Variety B Variety C Variety D 
b 179.1 165.8 119.9 168.4 
u 622.5 562.0 594.5 640.7 
u— kb — 93.9 — 101.2 114.9 - 32.9 
51 33.4 15 39.3 51 27.8 51 28.0 
52 38.8 25 35.6 12 13.1 22 25.0 
5 53 35.0 35 31.2 23 16.6 43 28.4 
54 37.6 45 27:3 34 21.9 14 21.0 
55 38.4 55 39.5 45 16.6 35 21.0 
b 183.2 182.9 96.0 123.4 
NC: 628.3 619.9 646.4 590.1 
u— kb — 104.5 — 111.7 262.4 96.5 
Replicate 
Total 852.3 910.9 539.3 687.7 
TABLE 13-5 
VARIETY TOTALS, 4-REPLICATE QUADRUPLE LATTICE 
Row Total 
11 12 13 14 15 
1080 132.5 108.5 86.6 139.9 575.5 
21 22 23 24 25 
120.4 1343 1408 117.9 133.9 647.3 
31 32 33 34 35 
737 110.2 1060 123.9 1028 516.6 
41 42 43 АА 45 
1442 182 1339 1095 1167 622.5 
5} 52 53 54 55 
132.5 1211 1240 1241 126.6 628.3 
Column лев 6163 613.2 5620. 6199 | 29902— G 


Total 
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TABLE 13-6 


PRELIMINARY ANALYSIS OF VARIANCE FOR QuADRUPLE LATTICE 


SS DF MS F 5% Point 
Replicates 3384.235 3 
Varieties 1664.745 24 69.3644 2.59 1.67 
Еггог 1927.960 72 26.7772 
Total 6976.940 99 
TABLE 13-7 


VALUES OF W,y, CORRECTED VARIETY TOTALS (rv,j) AND CORRECTED 
VARIETY MEANS (Vey). QUADRUPLE LATTICE IN 4 REPLICATIONS 


Variety Wer Vef Vef 
11 183.9 116.60 29.2 
12 49.7 134.82 33.7 
13 — 57.6 105.81 26.5 
14 63.2 89.56 22.4 
15 47,3 142.11 35.5 
21 - 1782 112.07 28.0 
22 8.1 134.68 33.7 
23 -17 140.72 35.2 
24 — 54.6 115.35 28.8 
25 — 224.1 123.42 30.9 
31 3.2 73.85 18.5 
32 54.6 112.75 28.2 
33 32.5 107.52 26.9 
34 148.2 130.83 32.7 
35 79.5 106.52 26.6 
41 — 145.9 137.38 34.3 
42 — 145.6 111.39 27.8 
43 — 28.1 132.59 33.1 
44 106.7 114.49 28.6 
45 162.4 124.29 31.1 
51 120.4 138.13 34.5 
52 — 79.9 117.36 29.3 
53 - 57.7 121.30 30.3 
54 — 116.1 118.67 297 
55 29.8 127.99 32.0 


Total 0 2990.2 
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This sum of squares is all that is required in order to set up the more 
complete analysis given in Table 13-8. 


TABLE 13-8 


COMPLETE ANALYSIS OF QUADRUPLE LATTICE WITH 25 VARIETIES 


ss DF MS 
Replicates 3384.235 3 
Blocks (eliminating varieties) 942.689 16 58.9181 = E, 
Varieties (ignoring blocks) 1664.745 24 
Intra-block error 985.271 56 17.5941 = E; 
Total 6976.940 99 


Since E, is considerably larger than E, the analysis indicates that the 
variety means should be corrected. We find 


1 1 


=—=——_— = 0. 724 
E, 17.5941 049653 
r—1 3 3 
f = 0.013 756 53 
rE,— E, 4 x 58.9181 — 17.5941 218.0783 
w—w 0.043 080 2) 
eua з ааа = 0.701 380 
rare! li — 1)» + z] (я 268 25 


For the partially balanced lattice that is not repeated, a short cut is 
available here which arises from the fact that the formula for 4 can be 
expressed 

E,— E, 58.9181 — 17.5941 


+ 397 —— = 0.701 380 44 
кле 58.9181 
Next we require 
A _ 0-701 380 L 0,046759 
(k— Dp 15 


The corrected variety totals are then calculated from (34). Thus 


Ға = 140.8 + 0.046 759 х (— 1.7) = 140.72 


These totals are calculated and set up as in Table 13-7, and finally the 
variety means are set up in the same table. The sum of the corrected 
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totals if carried to 2 decimals should check closely with the grand total 
of the experiment. 
It remains to calculate Vm and the corresponding standard error. 
From (37) we have 
_ 2 × 41 ( 4 4 0.701 =) 


m 4 ЗОЛ е 


= 8.797 05(1.155 862) = 10.1682 


and 
SE,, = V10.1682 = 3.19 


In order to make an F test that is more accurate than in the randomized 
block analysis of the significance of the variety differences, we apply 
formula (38), for which we require first the value of S,, the unadjusted 
sum of squares for blocks. Calculating this from the block totals of 
Table 13-4, we have 


.. 470,421.08 2,319,929.88 


Ss, = — ——— 
5 25 


= 1287.021 


where the correction term is the uncorrected sum of squares for replicates. 


We then have 
Wi 
Я aff ibus ТІ oe sj 


0,013 756 53 


= = 0701 380| (1 100157125 53. 
+ 350,056 837 24 


) 1287.021 - 942.689] — 314.335 


The corrected sum of squares is 1664.745 — 314.335 — 1350.410, and 
the corrected variance is 1350.410/24 — 56.27. The F value is then 
56,27/17.59 = 3.20. As would be expected, this is larger than the F 
value of Table 13-6. 

12. Example 13-2. Partially balanced lattice with groups repeated. In 
this experiment 16 varieties are arranged in a simple lattice with 4 repli- 
cates. Table 13-9 gives the plot yields and the basic quantities d, b, % 
and u— kb that must be calculated first. Table 13-10 gives the variety 
totals in the form of a square. Note that the row and column totals are 
the values оҒи for Table 13-9. 

The preliminary analysis, assuming the experiment to be laid out in 
randomized blocks, is performed first, but it is not necessary to reproduce 
the calculations here. 


rom d қылар Чч ч —— "A 
TABLE 13-9 
PLOT YIELDS, BLOCK TOTALS, AND VALUES OF и AND u— kb FOR А SIMPLE LATTICE IN 4 REPLICATIONS 
Block | Variety Replicate 1 Replicate 3 Group A Variety Replicate 2 Replicate 4 Group B 
| u 93.6 105.0 198.6 11 76.0 85.0 161.0 
12 99.5 97.0 196.5 | 221 113.5 115.6 229.1 
13 113.3 115.2 228.5 31 84:4 101.5 185.9 
14 100.8 95.8 196.6 | 41 87.5 98.6 186.1 
407.2 413.0 8202 b 361.4 400.7 762.1 b 
d-—58 1558.8 и d=— 39.3 15292 и 
- 81.6 u—kb 50 u—kb 
21 109.0 1027 211.7 12 98.1 78.2 176.3 
2 90.6 95.3 185.9 22 78.2 83.1 161.3 
23 79.4 106.6 186.0 32 93.0 93.6 186.6 
24 73.0 72.3 145.3 42 48.8 73.7 122.5 
352.0 376.9 728.9 b 318.1 328.6 646.7 Ь 
4--249 14718 u d = — 10.5 1359.3 и 
140 u—kb 65.9 u— kb 
A 76.1 105.0 181.1 13 113.1 102.5 215.6 
32 61.6 101.5 163.1 23 75.9 91.7 167.6 
33 80.6 93.0 173.6 33 104.5 78.2 182.7 
34 88.5 103.3 191.8 | 43 129.7 110.9 240.6 
306.8 402.8 709.6 b 423.2 383.3 806.5 b 
d= — %60 1495.6 и а = 39.9 16061 u 
764 u— kb —69 u—kb 
41 89.6 86.1 175.7 14 71.6 114.1 185.7 
42 76.4 90.7 167.1 24 92.5 924 184.9 
43 100.6 110.9 211.5 34 117.8 113.0 230.8 
“ $0.2 91.8 1720 44 92.6 96.4 189.0 
379.5 7263 b 374.5 415.9 7904 b 
d-—327 14645 а а= — 41.4 14961 ш 
419 u—kb = 847 u—kb 
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TABLE 13-10 


VARIETY TOTALS FOR SIMPLE LATTICE IN 4 REPLICATIONS 


Row Total 
11 12 13 14 
3596 3728 444.1 3823 1558.8 
21 22 23 24 
4408 34729 3536 3302 1471.8 
31 32 33 34 
3670 3497 3563 4226 1495.6 
4l 42 43 44 
361.8 2896 4521 361.0 1464.5 
Column |9) 13593 16061 14961 5990.7 
Total 
TABLE 13-11 


VALUES BY VARIETIES OF W,,, Туу, AND У,у. 


IN 4 REPLICATIONS 


SIMPLE LATTICE 


Variety Wey Fef Vef 
11 — 76.6 351.88 88.0 
12 551527! 371.22 92.8 
13 — 88.5 435.18 108.8 
14 — 166.3 365.54 91.4 
21 19.0 442.71 110.7 
22 79.9 355.25 88.8 
23 74 354.32 88.6 
24 — 70.7 323.07 80.8 
31 81.4 375.20 93.8 
32 142.3 364.04 91.0 
33 69.5 363.30 90.8 
34 Shs 421.76 105.4 
41 16.9 363.50 90.9 
42 77.8 297.44 74.4 
43 5.0 452.60 113.2 
44 - 72.8 353.66 88.4 

Total 0 5990.7 
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With the preliminary analysis made and the data set up as in Tables 
13-9 and 13-10, the following steps are required to perform the complete 
analysis. 


1. Calculate values of W,, from formula (29). For example 
Waa = 76.4 + 65.9 = 142.3 


2. Calculate sum of squares for blocks. This is obtained in two parts. 


2(4) У(р?) 
с t (a) = - 
omponent (а) 2p ap 45 
15,899.65 28,040.05 
$ 32 1111.204 


Component (5) formula (30) 


Е ЗЕ EI 025 


Combining components (а) and (0) gives 1111.204 + 1513.025 — 


2624.229. 
3. The total sum of squares and the sums of squares for varieties and 


replicates are obtained in the usual manner, and the complete analysis 
can then be set up as in Table 13-12. 


TABLE 13-12 


COMPLETE ANALYSIS OF VARIANCE FOR PARTIALLY BALANCED 
LATTICE WITH GROUPS REPEATED 


SS DF MS 
Replicates 882.947 
Blocks (eliminating varieties) 2,624.229 12 218.686 — E, 
Varieties (ignoring blocks) 7,389,881 15 
Intrablock error 3,584,612 33 108.625 = Е, 
Total 14,481,669 63 


4. Compute 4 and A/(k—1)p. We require w and w’, where 


w= z — 0.009 205 984 


i 
and, according to formula (31), 
3 


E = 0,003 915 841 
we Ст теше 
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Then 4 from (33) is 
_ 0.005 290 143 


e = 0.403.156 
0.013 121 825 
There is also a short cut here in that 4 can be computed directly from 


XE,— Е) | ЖЕ, — Ej) 
Xk—1)E,J ЕД 2Е,4Е, 


when k = 2 46 


(k D| 


It is assumed in this formula that there are not more than 2 repetitions 
of each group. For this example 


2(218.686 — 108.625) 


A 2 = 0.403 156 
2 x 218.686 + 108.625 
Finally 
_ 4 0.403 156 
= —— = 0.100 789 
(k — Dp 4 


5. Calculate corrected variety totals from (34). For example, 
ГУ = 289.6 + 0.100 789 x 77.8 = 297.44 


These, together with the variety means, are given in Table 13-11. 
6. Determine the standard error for comparing variety means. From 
(37) 
2 x 108.625 [ [pus =| 
Vn = ——— —— ] 1 +2 | - 
4 3 
= 54.3125(1 + 0.161 262) = 63.071 04 
and POL 
SE,, = V63.071 04 = 7.94 


7. Make an F test of the variety variance if this seems to be necessary. 
From (38) the correction to the variety variance is 


0.003 915 841 


= 0,403 156 Зе | 
a 5[( E aap aou 5 1513.025] 


where S, is the unadjusted sum of squares for component b, and S, is 
the adjusted sum of squares for component b. 4, is calculated as follows. 


820.2 + 728.9? +... + 790.47 — 2985.0? + 3005.72 
8 32 


= 2893.055 


(13) BALANCED LATTICE, 25 VARIETIES IN 6 REPLICATIONS 285 


Substituting in the expression above, we have 
— 0.403 156[(1.425 358 x 2893.055) — 1513.025] = — 1052.485 


Then the corrected sum of squares is 7389.881 — 1052.485 = 6337.397, 
and the corrected variance is 6337.397/15 = 422.493. Finally the F 
value is 422.493/108.625 = 3.89. 

13. Example 13-3. Balanced lattice, with 25 varieties in 6 replications. 
Table 13-13 gives the plot yields and block totals with replicate totals. 


TABLE 13-13 


PLOT YIELDS FROM BALANCED LATTICE EXPERIMENT WITH 25 VARIETIES 
IN 6 REPLICATES 


Replicate 1 Replicate 2 Replicate 3 Replicate 4 Replicate 5 Replicate 6 


Variety OTOUP Variety GTOUP. Variety Group Variety FFP Variety Group Variety Group 


11 345 11 39.1 11 423 и 42 11 372 и 40.0 
12 486 21 41.5 22 ал 32 448 42 339 52 46,3 
13 459 31 359 33 360 53 347 23 320 43 402 
14 451 41 446 44 442 24 437 54 427 34 46.0 


15 446 51 432 55 441 45 400 35 406 25 36.6 
b 218.7 204.3 207.7 205.4 186.4 209.1 
21 51.1 12 432 21 346 21 437 21 50.8 21 48.8 


b, 205.7 233.2 198.6 1837 2291 235.8 


by 
51 44.6 15 44.8 51 469 51 401 51 3.6 51 42.4 
52 48.8 25 370 12 50.8 22 488 32 35.5 42 38.2 
53 488 35 446 23 344 43 40.0 13 395 33 34.9 
54 412 45 44 34 468 14 483 44 357 24 38.8 
55 “4. 55 440 45 463 35 442 25 291 15 348 
b, 224.5 214.8 2252 221.4 171.4 189.1 
Жері сес 6 1021.8 983.9 996.9 
Tota 1086.7 1074.8 1081.1 
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After the preliminary analysis on the basis of the experiment's having 
been laid out in randomized blocks, the following are the steps in the 
calculations. 


1. Calculate W,, for each variety, using formula (39). For example, 
Wi = 5 х 235.3 + 6245.2 — 6(218.7 + 204.3 + 207.7 + 205.4 
+ 186.4 + 209.1) = 32.1 


2. Obtain from formula (30) the sum of squares for blocks from which 
the variety effect has been eliminated. This is 


921,964.14 
750 


The values of W,, in this example are given in Table 13-14. 


= 1229.286 


TABLE 13-14 


VARIETY TOTALS, VALUES OF W,, rv), AND Vy FOR BALANCED 
Lattice EXPERIMENT 


Variety FOR №, ту, A 
11 235.3 32.1 236.16 39.4 
12 281.3 - — 297.7 273.27 45.5 
13 267.9 283.3 275.54 45.9 
14 269.3 25.1 269.98 45.0 
15 248.6 183.2 253.54 42.3 
21 270.5 54.5 271.97 45.3 
22 271.3 123.9 280.64 46.8 
23 191.2 — 18.6 190.70 31.8 
24 257.3 — 24.1 256.65 42.8 
25 203.6 —322 `202.73 33.8 
31 192.3 205.3 197.84 33.0 
32 250.1 368.9 260.05 43.3 
33 234.0 — 88.4 231.62 38.6 
34 265.1 190.7 270.24 45.0 
35 2654 — 274.0 258.01 43.0 
41 261.1 --2511 254,33 42.4 
42 227.4 - 0.8 227.38 37.9 
43 2553  —-1247 251.94 42.0 
44 254.6 - 52.6 253.18 42.2 
45 2652 — 2048 259.68 43.3 
51 248.8 73.8 250.79 41.8 
52 2797 — 407.7 268.70 44.8 
53 2501 - 158.5 245.82 41.0 
54 248.0 164.0 252.42 42.1 
55 245.8 230.4 252.01 42.0 


Total 6245.2 0 6245.2 
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3. Set up complete analysis and calculate E, and Ej. This is given іп 
Table 13-15. 
TABLE 13-15 


CoMPLETE ANALYSIS FOR BALANCED LATTICE WITH 25 VARIETIES 
IN 6 REPLICATES 


55 DF MS 
Replicates 416.508 5 
Blocks (eliminating varieties) 1229.286 24 51.2202 = E, 
Varieties (ignoring blocks) 2431.477 24 
Error 1601.469 96 16.6820 — E; 
Total 5678.740 149 


4. Calculate 2 and obtain corrected variety totals. The formulas аге 


(44) and (34). 
_ 51.2202 — 16.6820 


= 0.674 
51.2202 Boden 
For formula (44) we require 
Е SIE SOE 16972 3 
(k — Dp 25 


Then a corrected variety total such as that for variety 24 is 
ту, = 251.3 — 0.026 972 х 24.1 = 256.65 


The corrected totals for this example will be found in Table 13-14. 
5. The variance of a difference between variety means from (35) is 


A 2x 1682 (1 4 56% =) 6.3106 


and the SE of a mean difference is V6.3106 = 2.51. 
6. Obtain the corrected variety sum of squares from (43). Here we 


have 
PUE 2 
(1,573,757.703/6) (6245.2?/150) — 2005.65 
1.134 86 


Then the mean square is 2005.65/24 — 83.569, and F = 83.569/16.682 


= 5.01. 
14. Lattice squares. The lattice square design, originally known as a 
quasi-Latin square, was first described by Yates [18] in 1937. In the 
basic design each replicate containing p? varieties is laid out so that there 
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are p plots along each side, or in other words there are p plots in each 
row and column. These rows and columns are then the error control 
units similar to the incomplete blocks in square lattice designs. In this 
way variations in soil fertility in two directions across the field can be 
eliminated from the experimental error. 

The lattice squares described by Yates may be classified as balanced 
types in that any one variety appears once either in a row or a column 
with every other variety. Actually the balanced types may be divided 
into class (1) in which r = !/;(p + 1) and each variety occurs in either a 
row or a column with every other variety, and class (2) in which 
r — (p + 1) and each variety occurs in both a row and a column with 
every other variety. 

In 1943, Cochran [3] described lattice squares designs that may be 
designated as partially balanced in that each variety does not occur either 
in a row or a column with every other variety. This extended the appli- 
cation of lattice square designs in that the experimenter was given a greater 
choice with respect to number of replications. The types described by 
Yates and Cochran are possible only if the completely orthogonalized 
square exists. Kempthorne and Federer [15] in 1949 and Federer [7] in 
1950 described what they term unbalanced lattice squares which are 
possible for any value of p where the Latin square exists, with replications 
in multiples of 3. Complete instructions for the analysis of these squares 
are given in the paper by Federer [7]. The possible lattice squares are 
summarized in Table 13-16. 


TABLE 13-16 


USEFUL RANGE OF LATTICE SQUARE DESIGNS FOR DIFFERENT VALUES 
OF p AND p®, ACCORDING TO NUMBER OF REPLICATIONS 


p= 4^5. 6 7 8 9 10 11 12 13 
у= р? = 16 25 36 49 64 81 100 121 144 169 
Partially balanced 3 3,4 3,4 3,4,5 3,4,5,6 
Balanced 
r= (p+ 1) 3 4 5 6 7 
r=(p+1) 26 8 9 10 12 14 
Unbalanced 3;6-326542/6:73;6279:6 39677 36% 276: 5267 376 


15. Lattice squares—Partially balanced, and balanced of type r = 
llKp-- 1). Since there are only slight differences in method in the 
design and analysis of these two types, they can be considered together, 
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For the type r =1/,(p + 1) the design of the squares is based on the 
method described in Section 6 for square lattices. We shall illustrate the 
method here for the 5 x 5 square, but for general application squares 
already made up can be obtained from the tables by Fisher and Yates [10] 
and from Cochran and Cox [4]. We first write out the 6 orthogonal 
squares for the 5 x 5 square as follows. 


I П ш 
1712013 4755 11. 21-21. 417551 11 22 33 44 55 
21. 22 23. 247.25 12 Ж 32 42752 21 32 43 54 15 
31/32 3% 360-325 13 23 33 43 53 31 42 53 14 25 
41 42 43 44 45 14 24 34 44 54 41 52 13 24 35 
5I (52 53! 54: 55 15 25 35 45 55 51 12 23 34 45 

IV у VI 
11 32 53 24 45 11 42 23 54 35 11 52 43 34 25 
21 42 13 34 55 21 52 33 14 45 21 12 53 44 35 
31. 52 23 44 715 31 12 43 24 55 31 22 ІЗ 54 45 
41 12 33 54 25 41 22 53 34 15 41 32 23 14 55 
51 22 43 14 35 51 32 13 44 25 51 42 33 24 15 


Then from any 2 of the above squares a new square can be formed such 
that the rows will come from the rows of the first square and the columns 
from the rows of the second square. Taking squares I and II, III and ТУ, 
and V and VI in pairs we get the following. 


I and II III and IV V and VI 
11 12, 13; 14 5 11 22 33 44 55 110.42 223 7547 А5 
21 22 23 24 25 32 43 54 15 21 52 33 14 45 21 
31: 73232247 35: 53 14 25 31 42 43 24 55 31 12 
41 42 43 44 45 24 35 41 52 13 34 15 41 22 53 
5:52 753754555 45 51 12 23 34 25 51 32 12-48 


Note that in these 3 squares апу one variety occurs either in a row or 
a column with every other variety. 

Assuming that all 3 squares are included in the design, we have a 
balanced lattice square with 3 = 1/4(5 + 1) replications. If only 2 of the 
squares are included, the design will be of the partially balanced type. 

After the squares have been selected, the rows and columns of each are 
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arranged at random and the actual varieties assigned at random to the 


numbers. 

The analysis of the results follows a logical procedure as outlined 
below, after the data are set up in the form of squares as shown diagram- 
matically in Table 13-17. The yields are arranged to correspond with 


TABLE 13-17 


DIAGRAMMATIC REPRESENTATION OF RESULTS FROM A LATTICE 
SQUARE EXPERIMENT 


n 1 2 3 

ЫТЫ ON RS L 
| | | 
| | | 

с T с T, c Т, 

S. Se S. 

M РІ М Р, M P, 

B B B 


Variety Totals 


G 


variety numbers as in the original squares before randomization. The 
row totals are represented by R, and the column totals by С. For each 
row S, is the sum of the totals of the varieties in that row, and for each 
column S, is the sum of the totals of the varieties in that column. The 
values of L and M are calculated as follows. 


L=S,—rR M=S,—rC 47 


P, is the sum of the values of L and М in the corresponding replicate. 
The various steps in the analysis are enumerated and described below. 
1. Calculate L for each row and M for each column, and total to obtain 
the value of P. Also, for each replicate, if G is the grand total of the 
experiment, Р, = G—rT,, where Т, is the replicate total, furnishing a 
check on the calculations. 
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2. Set up the outline of the analysis of variance as given below and 
calculate the sums of squares. 


DF MS 
Replicates r= 
Varieties (ignoring rows and columns) PA 
Rows (eliminating varieties) r(p— 1) Е, 
Columns (eliminating varieties) r(p— 1) E, 
Error (p-Q(p—-p-r-D E; 
Total тра 


The variety sum of squares, since it ignores the effect of rows and columns, 
is calculated directly from the variety totals. The sums of squares for 
rows and columns are calculated from the values of L, M, and P. For 
rows we have 
жі) ХР» 48 
rp(r—1) rpXr— 1) 


and for columns 
ОҚМУ EEE 49 
тр(ғ-1) rp(r—1) 


After calculating the total sum of squares and that for replicates, the error 
is obtained by subtraction. The mean squares for rows (E,), columns 
(E,), and error (E;) are obtained as indicated. 

3. Calculate the factors 4 and и from 


хз M NR с 50 
(r— DE, (Е, 


4 


noting that, if E, < E; 2 is taken as zero, and, if E, < E, ші taken as 
zero. From these factors we obtain 4/p and y/p, which are used in 
converting the values of L into х and values of M into f by multiplication. 


Thus Ў 2 А Й А 
:=1() s= m (5) 


Corrections to variety totals are made by adding the quantities and « and 
B for each row and column in which the variety appears. 
4, Calculate the mean standard error for comparing variety means from 


2Е, 
г 


ГЕРІ 52 
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Note that when r = 1/(р + 1) this formula reduces to 


ТЕ ЕЕ Ее ; + | 53 


The essential difference between partially balanced and balanced lattice 
squares is that in the former there are certain pairs of varieties that do 
not appear together in the same row or column. Further, in balanced 
lattice squares where r = 1/,(p + 1) there may be differences іп the 
comparison of varieties, depending on whether they occur together in a 
column or іп a row. These facts indicate that for the partially balanced 
lattice square there are three kinds of comparisons. Cochran [3] gives 
the formulas for the average experimental error per plot for these com- 
parisons, which are converted here to standard errors of differences 
between means. These are: 


Two varieties occurring in the same row: 


ШӘ + 


Two varieties occurring in the same column: 


fees] s 


Two varieties not occurring together in the same row or column: 


/ 
Fre à 
ААР P P 
For the balanced lattice square the third formula is not applicable. For 
both the balanced type with r = 1/,(p + 1) and the partially balanced type 
formula (52) given above is general. 

16. Example 13-4. The analysis of a 5 x 5 lattice square experiment 
in 3 replications. 

Table 13-18 gives the plot yields in a lattice square experiment described 
by Yates [21] together with the values of L, M, x, and f. The experiment 
concerned the effect of 25 manurial combinations on sugar beets, and 
since the interactions were of particular interest it was decided that a 
lattice square experiment giving equal accuracy to all comparisons was 
preferable to a factorial experiment in which some of the interactions 
were confounded. Table 13-19 gives the required treatment totals 
collected from Table 13-18. 


| ТАВІЕ 13-18 
YIELDS OF А 5 X 5 LATTICE SQUARE EXPERIMENT WITH SUGAR BEETS 
| Row Total L a 
sy ny n SZ 5 
61.2 67.7 68.1 65.2 599 322.1 17.1 1.4 
| mz nx mx nw nz 
69.4 67.4 64.9 77.6 67.7 347.0 25.6 2.1 
Ww x c cy m 
46.3 59.7 76.2 80.3 71.2 337 —93.7 —7.5 
о сх ту cw mw 
55.2 77.0 824 80.4 77.7 37227 —1208 -9Л 
| sw 5х 2 y cz 
| 73.9 79.0 51.6 44.8 78.4 327.7 — 79.9. —64 
Column | 
Total | 306.0 3508 343.2 348.3  3549| 1703.2 — 20.1 
M —392 —531 —403 —823 — 368 — 251.7 
B —14 -19 -14 -29 —13| —89 
mz 27 с 8 сх 
3 eros 49.6 72.6 68.7 74.0 3340 —381 -3.1 
| cw n mx w cz 
62.7 74.0 78.1 40.0 66.3 321.1 3.8 0.3 
ny nz sw my cy 
68.9 78.1 69.6 68.1 68.5 353.2 20.1 1.6 
m о 52 SX mx 
64.7 38.6 57.4 65.5 67.1 293.3 60.5 4.8 
2 х mw nw sy 
48.0 39.0 64.9 72.2 58.7 282.8 58.4 4.7 
Column 
Total 3134 2793 3426 314.5 334.6) 15844 83 
M 23.9 11.0 10.2 22.7 36.9 104.7 
B 0.9 0.4 0.4 0.8 1. 3.8 
sy my nx y m 
57.6 69.2 64.0 31.8 58.4 281.0 84.2 6.7 
nz SZ 2 w сх 
72.2 73.3 46.2 45.9 73.4 3110 —167 —13 
SX mz cy mw n 
61.4 119 73.1 70.4 70.5 353.3 9,9 0.8 
с cz о пу nw 
58.6 68.4 46.7 71.3 69.1 3141 45,5 3.6 
cw x 5 тх sw 
56.6 52.9 60.9 71.8 68.7 310.9 24.1 1,9 
Column 
Total 306.4 3417 2909 2912 340.1] 15703 11.7 
M 893 — 28.4 34.5 9.5 42.1 147.0 
B 32 -10 1.2 0.3 1.5 5.2 
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After the row and column totals are obtained, the following are the 
steps in the calculations. 

1. Calculations of L and M. See Table 13-17 and formula (47). For 
example, the second value of L in the first square is 


216.4 + 209.5 + 203.8 + 218.9 + 218.0—3 x 347.0 = 25.6 
The first value of M in the first square is 
177.5 + 216.4 + 132.2 + 140.5 + 212.2—3 x 306.0 = — 39.2 


2. Obtain sums of squares and set up analysis of variance as in Table 
13-20. Note that the sums of squares for rows and columns eliminating 
treatments are obtained from the quantities L and M as indicated in 
formulas (48) and (49). Thus for rows eliminating treatments we have 

ss (17.1? + 25.6? + + + + + 24.1?) — 135(251.7? + 104.7? + 147.02) 


= 1019.26 
For columns 


(39.2? + 53.12 + ° + + + 42.18) — (251.72 + 104.72 + 147.0?) 
= 314.34 


TABLE 13-19 
TREATMENT TOTALS, UNCORRECTED AND CORRECTED FOR 
EXPERIMENT OF TABLE 13-18 
Uncorrected 
о п с 5 т 


о 1405 2126 2074 189.5 194.3 
w 13229 “21897 — 19971272122 11 2130 
x 151.6 209.5 2244 2059 203.8 
y 1262: 20,79 2229 1775. 2197 
2 1458 2180 213.1 195,9 2164 
4857.9 = G 
Corrected 
о п с 5 т 


NSH IS 
> 
mm 
ы 
кю 
9o 
p 
s 
ы 
8 
гі 
m 
=! 
N 
o 


4857.9 = G 
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The analysis of variance is given in Table 13-20. 


TABLE 13-20 
ANALYSIS OF VARIANCE OF 5 x 5 LATTICE SQUARE 
55 DF MS 
Squares 426.33 3 
Rows (eliminating treatments) 1019.26 12  8494— E, 
Columns (eliminating treatments) 314.34 12 26.20 = E, 
Treatments 7346.68 24 
Error 405.04 24 16.88 = E; 
Total 9511.65 


3. Determine A and и and correct treatment totals. From (50) 


84.94 — 16.88 
eRe 8 52% 
26.20 — 16.88 _ 0.1779 


еее. аа 
= 72 x 26.20 
For the correction of treatment totals we require 


4 - 0.0801 Ë T — 0.0356 


p 


To obtain the quantities ж and f set up in the margins of Table 13-18 
we use formula (51). Thus the first value of « in square 1 is 17.1 х 0.0801 
= 1.4, and the first value of f in square | is — 39.2 x 0.0356 = — 1.4. 

The corrections are then added to the corresponding treatment totals. 
Thus for sy the corrected total is 


177.5 + 14 — 14 + 4.7 + 1.3 + 6.7 + 3.2 = 193.4 


4. Variances of differences between treatment means are obtained 
from (52). 


m gum 
and 


Vin = 2001088 fı + 5 (0.4006 Б 0.1779) = 11.253 x 1.2892 = 14.507 


SE, = V14.507 = 3.81 


For this experiment it is of interest to compare the results with those 
that would have been obtained if the design had been randomized blocks. 
As pointed out by Yates [19] lattice square experiments, like lattice 
experiments, can be analyzed as if they were laid out in randomized 
blocks. This gives the analysis shown in Table 13-21. 
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TABLE 13-21 


ANALYSIS OF LATTICE SQUARE AS A RANDOMIZED BLOCK EXPERIMENT 


55 SD MS 


Replicates 426.33 2 

Treatments 7346.88 24 

Error 1738.44 48 36.22 
Total 9511.65 74 


The variance of a mean difference between treatments is then 


2 x 36.22 


= 24.15 
3 


Comparing this with the variance of 14.51 obtained for the lattice square, 
we have 24.15/14.51 = 1.66, showing that the gain in efficiency is 66%. 
The number of replicates of a randomized block experiment required to 
give the same accuracy as the lattice square is 1.66 x 3 = 5 approxim- 
ately. 

17. Balanced lattice squares with r = p + 1 when p is a prime number. 
The procedure for setting up a design of this type is very simple. We 
write out the 1/,(p + 1) squares as described in Section 15 and a second 
set formed from the first by turning them at right angles. In other words 
the rows of a square in the first set become the columns of a square in the 
second set. Complete randomization then gives the actual field layout. 
For p = 4, 8, and 9 it is best to consult the tables cited above for actual 
designs. 

The analysis of the results can be followed from Table 13-22 which 
gives the results from a field test in diagrammatic form for 9 varieties in 
4 replications and shows the basic quantities that must be calculated. 

The outline of the analysis on which the calculation of the corrected 
variety yields is based is as follows. 


DF MS 
1 Squares 
2 Rows (eliminating varieties) p-1i 
3 Rows (eliminating varieties and columns) jd E, 
4 Columns (eliminating varieties) p-—i 
5 Columns (eliminating varieties and rows) pii E, 
6 Varieties (ignoring rows and columns) p-1 
7 Error (p— 2p? — 1) Е, 
8 Total РХр-- 9-і 


| 


әке йы за, 
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TABLE 13-22 


DIAGRAMMATIC REPRESENTATION OF DATA FROM A BALANCED LATTICE SQUARE 
EXPERIMENT WITH p + 1 REPLICATIONS. RESULTS ARE ASSUMED To Ве ENTERED 
IN POSITIONS OCCUPIED BY VARIETY NUMBERS 


R R R R 
IF 127 42 11: 21 3T 11022433 1 7320022 
217 22: 23 1200532 32 13 21 227 12731 
31° 32/33. 13 1237-83 237314112; 335214702 
с с | € с 
SO») 
Variety Totals S(R,) S(C,) D, = S(R) — S(C)) 
г, Jy K, м", 


S(y,) = variety totals 
S(R,) = sum of all row totals containing variety v 


S(C,) = sum of all column totals containing variety v 


D, =S(R,)— SC) s 
L, =р$(у)— (р + DSR.) + G 
VS = Dy 


K, -/,-ғ(р- )р, 
М, = Ko + D, 
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In order to obtain the error sum of squares by subtraction we require 
only the sums of squares 1, 2, 5, 6, and 8, or alternatively only 1, 3, 4, 6, 
and 8, but 3 and 5 are required in order to obtain E, and E,, so the logical 
procedure is to calculate all the sums of squares shown, following the 
alternative procedure to check the calculations. The following are the 
formulas required to obtain the sums of squares for rows and columns. 


Zn 
pXp + 1) 
XU» 
pXp + 1) 
X(My 
Pp +1) 
АҚАУ 
pXp — 1) 
The remaining sums of squares are calculated in the usual manner. 
After obtaining Е,, Е„ and E,, we find 


2 


58 


59 


Then 
Wi Wy 


pem Eu as 
Ww, + w + (p— Du, 


60 
Wi— We 


Wr + we + (p— Dw; 


== 


If E, < E, we take w, = w,, and if E, < E; we take w, = w 
adjusted variety totals are calculated from 


The 


n= 509 (4) 1. + (2) м, 61 


TI 
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The variance of a mean difference between 2 varieties is 


_ 2E (1 + 4+ ® 
MN р+1 


18. Loss of information due to inaccuracies of weighting. In all the 
lattice designs we have calculated the values of the weighting factors (w) 
from the data of the experiment. In the discussions of these designs by 
Yates [20] and Cochran [1] it is pointed out that the weighting factors are, 
strictly speaking, population parameters. In other words, if their true 
value were known and could be applied in the analyses, there would be no 
loss of information. Actually they are statistics calculated from samples, 
and consequently they are subject to variations due to random sampling. 
This variation introduces an additional element of variability into the 
corrected variety means, with a consequent loss of information. Yates 
[20] has presented data to show, however, that the loss of information is 
quite small and does not introduce sufficient error to reduce materially 
the efficiency of the lattice designs. Cochran [3] presents similar data 
for the partially balanced lattice squares. 

19. Other incomplete block designs. The lattice designs that are dealt 
with above can be employed only when the number of varieties or treat- 
ments to be tested is a perfect square. Other designs involving the 
principle of error control by means of incomplete blocks are possible; 
some of them are extremely useful, but for lack of space it is impossible 
to describe them in detail here. There is, for example, the balanced 
incomplete block design based on a number of varieties equal to p* — p + 1 
and r = р. Thus for p = 5 we can make up a test for 21 varieties in 5 
replicates, or for p = 6 we сап design a test for 31 varieties in 6 replicates. 
In this design the varieties cannot be arranged in groups in the field 
corresponding to complete replicates. This condition is undesirable 
because the investigator frequently wishes to check over all the varieties 
for some characteristic in 1 or 2 replicates and does not wish to cover the 
whole experiment. If the design is of the type where the number of 
varieties = p? — p + 1, such a check is practically impossible. Not 
having distinct replicates it is also impossible to revert to the randomized 
block method of analysis when the block variance proves to be equal to 
or less than the error variance. 

Other incomplete block designs similar to the one described above are 
possible for various combinations of variety numbers and replicates. 
Descriptions of these will be found in the references at the end of this 
chapter and particularly in Cochran and Cox [4]. | 

The cubic lattice is a particularly useful design for experiments in which 
the number of varieties exceeds 121. It is based on the fact that р? 
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numbers can be set up in the form of a cube with p numbers along each 
Side. The p? varieties may then be divided into p* groups of p each and 
arranged to occur correspondingly in blocks of p plots each in the field. 
For a complete description of this design and method of analysis, sce 
Yates [19], Phipps et al. [16], and Cochran and Cox [4]. 

Other incomplete block designs can be made up from sets of numbers 
that can be arranged in a rectangle of p x q dimensions. These were 
first described by Yates [17] in 1936 but were not employed extensively 
on account of the somewhat complex analytical procedure. More 
recently another version of this design has been described by Harshbarger 
[12, 13, 14]. These designs are extremely valuable because they allow 
for experiments with variety numbers other than those possible for the 
Square lattices and lattice squares. It is possible, for example, to employ 
Harshbarger's designs for у = p(p + 1). Thus we сап have 4 x 5 = 20, 
5 x 6 — 30, 6 x 7 = 42, and so forth. 

20. Missing values in incomplete block experiments. Methods of esti- 
mating missing values and making corresponding corrections in the 
analysis have been described by Cornish [5]. If one is satisfied with an 
approximate value, it is sufficient to treat the experiment as if it had been 
laid out in randomized blocks and apply the formula given in Chapter 11, 
Section 5. In many cases this will be adequate, but if the missing value 
Occurs in a low-yielding block it will be slightly overestimated, and con- 
versely it will be underestimated if it occurs in a high-yielding block. 


21. Exercises. 


1. Table 13-23 gives the yields of oats in bushels per acre in a simple laitice experi- 
ment. Carry the analysis through to step 3 of Example 13-2, and come to a conclusion 
as to how the analysis should proceed from that point. If in doubt work through 
Steps 4 and 5 and note the size of the correction to the variety totals. 

2. The yields for a 5 x 5 lattice square experiment are given in Table 13-24, This 
experiment is made up from uniformity data; therefore it is not expected that the 
variety variance will be significant. The yields are given in the table in the same 
position as the plots in the field. In order to reduce the work in this exercise, row, 
column, and replicate totals are given and the total sum of squares before taking away 
the correction term is 480,392.00, Тһе following values for variety 1 will be of assistance 
in checking the methods, 


S(y) = 328. 5(В)) = 1670. 5(С)) = 1717. D, —47. Ly =~ 2. =~ 49. 
Мт =~ 284. K, => 237. әр = — 0.08. В, =~ 12.95. rv, = 315.0. 


3. Design square lattice or lattice square experiments for the following numbers of 
varieties, giving reasons for the particular designs selected: у = 25, 36, 49, 64, 81, 100. 


(21) EXERCISES 301 


TABLE 13-23 


YIELDS IN A SIMPLE LATTICE EXPERIMENT WITH 25 VARIETIES 
AND 4 REPLICATIONS 


Variety* Yield Variety Yield Variety Yield Variety Yield 
11 50.8 H 554 41 88.6 54 62.0 
12 88.2 21 88.4 44 61.3 44 53.8 
13 50.3 31 75.4 45 714 24 58.4 
14 73.9 41 94.8 43 48.1 34 52.4 
15 49.2 51 32.1 42 52.1. 14 70.6 


21 66.5 12 67.7 14 81.6 43 67.1 
22 41.6 22 50.6 12 76.2 53 49.5 


23 76.6 32 68.1 15 72.3 33 61.4 
24 45.3 42 48.5 13 59.6 23 941 
25 83.8 52 70.5 11 48.8 13 80.3 
31 69.4 13 60.9 54 52.9 31 68.3 
32 71.0. 23 66.8 52 72.6 21 60.1 
33 59.1 33 59.5 51 50.3 41 65.1 
34 397 43 54.3 55 88.6 11 61.8 
35 64.2 53 73.3 53 76.6 51 52.7 
41 82.8 14 58.1 24 78.3 32 44.9 


42 50.6 24 60.0 22 40.7 22 53.6 
43 67.3 34 49.4 25 89.9 42 59.1 


44 51.3 44 79.0 21 85.7 52 52.3 
45 82.1 54 46.3 23 85.5 12 60.9 
51 68.1 15 49.6 31 67.5 45 68.6 


52 72.3 25 62.7 34 40.0 35 61.8 
53 65.7 35 72.2 32 64.8 55 65.4 
54 46,7 45 52.7 35 58.9 25 97.6 


55 69.7 55 69.7 33 57.9 15 58.0 
Group X Group Y Group Х Group У 
Replicate 1 Replicate 2 Replicate 3 Replicate 4 


* Varieties shown in systematic order in replicates 1 and 2 but in the actual experi- 
ment blocks and varieties were fully randomized. This was a test of oat varieties 


conducted at Ottawa, Ontario, in 1942. 


TABLE 13-24 


YIELDS iN A 5 x 5 LATTICE SQUARE EXPERIMENT WITH 6 REPLICATIONS 


Replicate 1 Total Replicate 2 Total 
18519 17 720 46 ДОЛОО e8 
62 60 53 44 45, 264 63 61 54 46 46) 270 
ST 455,500 a 34 ел ҮЗЕ CK25 0 23] 
ӨШ 522 561051 139 | 24 59 55 58 50 46 | 268 
1282401722025 IY i 4 1 229 е) 
59 62 60 44 43 26% 64 63 66 55 52!) 300 
8 9 7 40 6 9 6 7810 8 
64 58 63 48 48| 281 70 66 59 49 46| 290 
СИЕЗ Чор 152 11 Ж Ша М2, 13: 213 
69 “65 .69 55. 51| 309 ЭТ (522 60: 546 41] 257 
Total 315 297 306 242 226 | 1386 Total 313 298 297 246 231 | 1385 
Replicate 3 Total Replicate 4 Total 
2% 217290" "gi 04 9 4 19 2 14 
БВ OSS 271 45. 74212252 64 65 49 43 48| 269 
24, М8: еэ 6 87 3, \Ы8@ 23. ЧЗ 
ПКВ O SO #53' | 310. 62 63 49 44 45 263 
ON 222 МАКА? 6 Б (immer? ӨМ 
SS U1 8 590 5810341 МӘ 2622-55) 2452 21) 289 
He NOE) DAS ТОЗА (37 200 952745 
66 63 62 9 51| 291 76 67 61 54 53) 311 
132127 М 192 25 IA AULD (aR 42 
61 58 50 49 50) 268 63 64 49 44 44| 264 
Total 330 315 311 252 254 | 1462 Total 340 322 263 230 241 | 1396 
Replicate 5 Total Replicate 6 Total 
D 595 764 220 719 ӘП 157474 15 
62 54 9 4 44! 250 69 644 59 51 47| 290 
LY ee DVIS © v 221 CS EH, 492722 
67 67 56 48 44| 282 73 73 71 56 46) 319 
Ша 7827 222 За 10 20-120 23. 1 9 
65 63 54 46 46| 274 62 63 59 48 43) 275 
24:25 242 I6. 12 dd. 76,717 28703 
ӘЛІ 2637 152! 45° 417): 248 62 60 53 45 42 262 
24 ЕМЕ 205070 73 2 24 10 13 16 
57 49 54 48 9) 247 66 68 68 52 48% 302 
Total 308 286 265 228 214 | 1301 Total 332 328 310 252 226 1448 


* Italics are variety numbers. 
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СН-АРТТЕВ 14 


The Treatment of 
Non-Orthogonal Data 


1. The meaning of non-orthogonality. Ordinarily an experiment is so 
designed that the various effects to be measured are mutually orthogonal. 
This means that they are not entangled in any way and the means are 
direct estimates of the effects required. The data from a simple random- 
ized block experiment may be represented as follows where there are 6 
treatments and 4 replicates. 

Treatment Replicate 
A EN Ch Day EB Total 


I ГА 
Replicate E: Ж 
3 
IV Ry 

Treatment 
Total Сі C6 Сұ С, С; C G 


It follows that, if the data are complete, any differences in the yields 
of the replicates will not affect the differences between means of the 
treatments. Similarly the differences between replicates will not be 
affected by treatments. In this instance the replicate and treatment 
effects are correctly referred to as orthogonal. Conversely, if there is any 
variation from the conditions as laid down, the treatment and replicate 
effects will be non-orthogonal. For example, if one ог more of the plots 
are missing, treatment means will then not be independent of replicate 
means, and vice versa. If a plot of treatment D, for example, is missing 
in a high-yielding replicate, the mean of the remaining plots of D will be 
lower than it should be in comparison with the other treatments. It is 
obvious therefore that variation in the replicate means will be reflected 
by variation in the treatment means. In other words, a variance calcu- 
lated for the treatments will be due partly to treatment effects and partly 
to variation in the replicates. Thus the two effects are entangled, and 
special methods must be adopted in order to obtain a variance for the 
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treatments that is corrected for replicate effects, before tests of significance 
can be applied. 

2. The effect of missing values. The effect of missing values on ortho- 
gonality can be demonstrated very simply in a 2 x 2 table. In a table 
such as the one given below we can take — а and а to represent the 
response to A and, — b and b for the response to B. Error terms аге 
neglected because they do not make any difference to the general 
conclusion. 


B A A 
0 1 Total Mean 
0 — (a+ b) — (a — b) — 2a =q 
A 
1 (a— b) (a +b) 2a a 
B Total — 2b 2b 0 
B Mean —b b 


From the marginal totals and means of this table it is clear that the 
responses of A and B are independent. Suppose, however, that the value 
representing the combination AB, is missing. We then have the situation 
illustrated in the next 2 x 2 table. The differences between the un- 
weighted means are 2a -- b for A and a + 2b for B, showing very clearly 
that the two responses are not independent. 


B A A 
0 1 "Total Mean 
0 | -ағ» | -@+®Ю — (a+b) 
A 
1 @-) | азы | за а 
В Total — 2b a+b | a—b 
B Mean —b a+b 


Another very striking feature of the non-orthogonal table is the failure 
of the addition theorem for sums of squares and degrees of freedom. In 
the first table it is easy to show that the sums of squares and degrees of 
freedom for A, B, and AB are 


SS DF 
A да? 1 
B 4b? 1 
AB 0 1 


Total Жаз + b?) 3 
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If the total sum of squares is determined directly, we find it to be 
4(@ + b°), proving that the addition theorem holds. In the second 
square we have 


SS DF 
A 3(2a + b} 1 
B #а + 2b) 1 
AB $(a — by 1 
Total 4(а° + ab + b?) 3 


where each of the component sums of squares is calculated directly, and 
these are added to obtain the total. Then on determining the sum of 
squares for the total directly it is found to be §/,(a2 + ab + 2). This is 
less than that obtained by summation for the 3 effects by an amount equal 
to */ (a? + ab + b?). Note also that there are only 3 determinations in 
the table so that it represents only 2 degrees of freedom, whereas on 
assuming 1 degree of freedom for each effect the total should be 3. This 
result for degrees of freedom agrees with that obtained for the sums of 
squares. 

3. Treatment of a non-orthogonal 2 x 2 table. Having noted that the 
ordinary methods of analysis will not apply to non-orthogonal data, it 
remains to decide if anything can be done to obtain the maximum amount 
of information from the results. It is obviously unreasonable to discard 
the data as they may contain valuable information. 

The 2 x 2 table is examined first because it is essentially the simplest 
case. Suppose that we have a 2 x 2 table as follows: 


B Total 


Xx | x 


K | 2 Ү--2 


XY 2 IA 


or if we prefer to think in terms of figures we can take 


B Total 


8 7 15 


12 7 19 
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The first step is to decide how to insert an estimate of the missing value 
that will make the means comparable, i.e., without entanglement with the 
other factor. This can be done very simply by making the missing value 
(X + Z— Y) or 11 — 8 = 3, in the numerical table. The basis for this 
is that we have selected a value that will make the interaction zero. We 
then have a set of completed values as distinguished from the original 
values. The two tables are 


B Total B "Total 
х xz + 2Х-2-Ү ШЫ 1 
А ШД A | 
YET Y+Z 8 | 7 15 
| 
XY X«22- Y| XX Z) 12 10 2 


The means for A are (2X + Z — Y)/2 and (Y + Z)/2, and the difference 


is 
FRZ 2Х--2-- Y 


2 2 


Y- X 


Similarly the means for В are (X + 2Z— Y)/2 and (X + Y)/2 and the 
difference is 
Х-22-Ү X+Y 
2 2 


2—Ү 


These differences агё obviously free from any effect of one factor on the 
other, On examining the table it is clear that the method is logical. 
For example, the response to A can be measured only for the first level 
of B. The second level of B does not give a comparison, and consequently 
the observation represented by Z must be discarded. 

The second step is to decide what tests of significance, if any, can be 
made. This is slightly more difficult as it involves the underlying theory 
of the analysis of variance. We note that since there are only 3 deter- 
minations there are only 2 degrees of freedom, and if 1 is used for an 
estimate of the variance for А, and 1 for В, there are no degrees of freedom 
left for an estimate of error. Since there is no estimate of error, this 
problem does not fall into the class for which a general solution can be 
found to the analytical procedure. y 

4. Missing values in a 2 x c table. In referring to tables of various 
dimensions we shall adopt a standard notation r x c, where r refers to 
the number of rows and c the columns. Let us assume that the table is 
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as follows, representing the data by means of both algebraic and numerical 
symbols. 


B B 
0 1 2 3 0 1 2 3 
yi 0| Xo Х Xo | Ri А 0| 4 6 8 18 1 
1| Х Ху Ха Ха | Ra ЖУ ҰТТЫ 6 6 | 30 
BO CLE CE | G 12 16 14 6 | 48 


The missing value is the one representing the result for the treatment 
combination 4,B;, but the methods of analysis described below are quite 
general and can be applied without difficulty when any value is missing. 
In the first place it is instructive to apply methods that may be described 
as intuitive. This is important in that the application of these methods 
gives a clear idea of what happens when other methods are applied 
that otherwise may seem to be merely the manipulation of formulas 
and rules. 

In the first place a decision must be made on what tests of significance 
can be applied, and this involves the calculation of mean squares for A 
and B that are free of entanglement, and a suitable error. The response 
to A can be obtained from the values that are present in both rows. 
Therefore, we take 


Тү = Xoo + Xo + Xo Ta = Xo + Xn t+ Xi 


and the sum of squares for А = (T, — T,)?/6 = (18 24)?/6 = 6.0. The 
sum of squares for B is slightly more difficult to calculate, but the method 
follows easily when we note that the 3 degrees of freedom for the effect 
can be split up into 1 for the comparison (Xio + X, + X4 — 3X3) = 
Т» — 3X, and 2 degrees of freedom for variation among the 3 totals 
Co, Cı, and Cg. The sums of squares are therefore 


T,—3X4*  (Q4— И 
(,—3X,* _ 04—18 _ 0 


1 DF 
LX3-L3 12 


Се C FC? (C04 С) 


2 DE 
2 6 


122 + 16° + 14 422 
2 6 


4.0 


Total for B (3 DF) = 3.0 + 4.0 = 7.0 
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The next step is to obtain a sum of squares for error. An obvious pro- 
cedure is to determine the sum of squares within levels of A and to 
subtract the portion that can be attributed to B. This gives 

18? 


Within А, E area = 80 


2 
Within A, 8+ 10+ + T 11.0 


Total within A = 19.0 
B = 7.0 
Error = 12.0 


It is now possible to set up the analysis of variance. 


55 DF MS 0 


B 7.0 3 23 
А 6.0 1 6.0 1.0 
Error 12.0 2 6.0 

Total 25.0 6 


The failure of the additive property for non-orthogonal data is again 
demonstrated by this example. The total sum of squares determined 
directly from the 7 variate values is 22.86, which does not agree with that 
obtained by totaling the sums of squares obtained for A, B, and the error. 
The sums of squares for А and В may be regarded as adjusted sums of 
squares. There is an agreement with the total sum of squares calculated 
directly if the components are set up in either of the following ways. 


55 55 
В (unadjusted) 4.86 В (adjusted) 7.00 
A (adjusted) 6.00 A (unadjusted) 3.86 
Error 12.00 Error 12,00 
Total 22,86 Total 22.86 


This fact will be made use of in further developments. If, for example, 
the error sum of squares can be calculated directly, the adjusted sums of 
squares can be obtained by subtraction. 

It now remains to obtain means for the rows and columns of the table 
that are directly comparable. The neatest method is to calculate a value 
for the missing variate which, if inserted in the table, will accomplish the 
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desired result. This is done by reverting to the method illustrated for a 
2 x 2 table except that now the table is made up from 


X. Xoo + Ха + Хо X. Xio + Xu + Xe 
0 3 = 3 


and X; from column 3 of the original table. This provides a 2 x 2 
table with one missing value. 
X, 
| 5, Ха 
and the estimate of the missing value that will make the interaction zero 
is Xo + Хаз — Ху. Using figures, we have the 2 x 2 table 
6 4 


8 6 


where 4 is the calculated value of the missing variate. The completed 
table is now 


B B 
ONT Gan v3 а, Жа” PERS 
ЕШ е s Xal Xa 14. 10,4 6 8 4|2 
Ша Хі Ka лы | Ti Xs LES IO б 6 |30 
| e. 
hy ORS RON Moh ES а о 12 16 14 10| 52 


The A means аге (T, + X,)/4 and (Ts + X33)/4, and their difference is 
7, Нұ Ха Ха 24-18 6-4 6+2 


4 4 IE y тұ 


This is equal to (T4 — Т,)/3, because Xos has been selected so that 
X13 — Хә --(Т,- T,)/3. It can be shown similarly that changes in the 
A response will not affect the means for B. 

5. Treatment of non-orthogonal data by the method of fitting constants. 
The procedure outlined in Section 4 for a 2 x c table with 1 missing 
value would be rather difficult to apply with more than 2 rows in the 
table and especially with more than 1 missing value. It is therefore 
desirable to have a general method that can be adapted to the analysis of 
any non-orthogonal example. Yates [6, 7] has presented for this purpose 


(5) THE METHOD OF FITTING CONSTANTS 311 


the method of fitting constants, and it is applied here to the simple 2 x 4 
table discussed in Section 4. 

Yates has pointed out that the whole procedure of the analysis of 
variance can be regarded as a process of fitting constants for the various 
effects. Thus, for a 2 x 4 table, as follows. 


B 
0 1 2 3 Total Mean 
4 0 Ry To 
1 X 10 X 1 X 12 X, 13 Rı n 
Total & CG & с; G 
Mean б сі © © х 


Each variate Х can be regarded as made ир of four parts: 
a+b+m+z 


where a represents the effect of the level of A, b the effect of the level of B, 
m the mean of the table, and z the random variation to be sorted out as 
experimental error. For a there will be 2 values 40 and a,, and for b 
there will be 4 values, bo, bı, ba, and bg. The obvious method of finding 
the values of a and b is by the method of least squares, i.e., by setting up 
an equation 
Y=a+bh+m 1 
and minimizing the sum of squares X(Y — a — b— my. This leads to a 
set of simultaneous equations that can be written down by rule. This 
rule can be observed from the equations for the above 2 x 4 table, which 
are written below. 


G = 4a, + 4a, + 2b, + 2b, + 2b, + 2b, + 8m 
Ry = 4a) + bo + by + by + b+ 4m 
К, = 41+ bot by + b+ bg + 4m 
Cy = a+ а + 2b, + 2m 2 
Cy асас 2b, + 2m 
Gy ау сагу 2b, + 2m 
Су Uae ага: 2b, + 2m 


The rule arises from writing down for any total such as Со the a, b, and 
m components involved. 
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Since the constant m has been included, it is permissible to define 
Ха) = 0 and X(b) = 0, and the quantities а and b will then represent 
plus and minus deviations from m. The above equations then solve as 
follows. 


=8 =-= # 
G m Hv. 
R R 
Ro = 44, + 4m а +m 2 do S т-%-т 
Similarly | 
а = 7 m 
Also 
(e С, 
б-2%--27 bo im > by y т-ф-т 


Similarly 


b=t,—m `ь=б%—т b,—ü—m 


It will be remembered from Chapter 8, Section 5, that the sum of 
Squares representing the fitting of the partial regression coefficients was 
given by 

SS = Б. (хау) + (хау) + b Elx) + ° ° < 3 

Similarly the sum of squares for the fitting of the above constants is 
given by 

SS = boCo + АС, + byCy + 6С, + аук + aR, + mG 4 


7,R, + XG 5 


= ®С + &С, + eC. + 6C, + FoRo 


Ch CO +P RÊ +R? о 
2 4 8 


Now in the analysis of variance 


Се + C2 + CIC G 
2 8 


is the sum of squares for columns and 


is the sum of squares for rows. Therefore the total sum of squares less 
the sum of squares іп (6) gives the sum of squares for error. Thus the . 
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sum of squares in (6) is the reduction in the total due to fitting the 
constants. 

The essential difference between orthogonal and non-orthogonal data 
lies in the fact that for orthogonal data the fitted constants are equal to 
the means of the rows and columns and the general mean, whereas for 
non-orthogonal data the constants and the means do not necessarily 
correspond. Basically the fitting of constants is a procedure enabling us 
to set up a set of adjusted values, Ү, corresponding to the original values 
such that in the adjusted values the interaction is zero. This applies to 
both orthogonal and non-orthogonal data. For the above example 


Yoo = m + To + Fo Үй-т-Ға--т etc. 


and it can be seen that for such a set of values the additive property is 
artificially imposed and therefore that the interaction will be nil. It is 
for this reason that in actual data the sum of squares of the differences 
Х(У- Xy will be the sum of squares representing interaction because it 
measures the extent of the discrepancy with the additive property. In 
the case where the rows are replicates and the columns treatments it is 
clear that Z(Y — X)? will be the error sum of squares. 

The above procedure will now be applied to a 2 x c table in which 
there is a missing value. If the variate Хо; is missing we will have the 
table: 


0 1 2 3 Total 


Total Cy Гел с, G G 


and the simultaneous equations are 
G = 3a, + 4a, + 2b, + 2b, + 2b, + by + Tm 


Ro = Зар + byt b+ ba + 3m 
R= +4a+ bot b+ ba+ bd 4m 
Сұ- a+ а + 2b, + 2m 7 
C= a+ 4 + 2b, + 2m 
C= dot а + 2b, + 2m 


С: = d, + b+ т 
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and, from (а) = 0 and 25) = 0, these simplify to the following, and 
the first equation can be omitted because it is not independent of the 
others and therefore not required in the solution. 


Ro = 3ag + by + b, + b + 3m 


R, = 4a, + 4m 
Co = 2b, + 2m E 
C, — 2b, 4- 2m 
С, = 2b, + 2m 


C= а + b5,--m 
By simplification and substitution in the above equations we have finally 


2R, 4- А, + 2С, 
аа ычкына 


12 А 
R Оса 
а = "m ata зы d, = — 40 10 
G © ©; R 
by = om b= 2 m h 5 т f Co i 11 


These formulas are easily applied to the numerical example of Section 4, 
thus: 
EON MEUS 


12 
18 42 
0 О 4, — +1.0 


12 

by = 7 — 65 — 05 
16 

b= 65— L5 
14 

by 765-05 


30 
0 15 


As а check we note that Х(а) = 0, and X(b) = 0. 
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The adjusted values for the table can now be obtained. For example, 
Yoo = — 1.0— 0.5 + 6.5 = 5.0. The completed table of adjusted values 
is as follows, with the fitted constants and the original totals in the margins. 


a R 
5.0 7.0 6.0 40 | — 1.0 18 
| 7.0 9.0 8.0 6.0 | + 1:0. 30 
b: |= 05 1.5 05 2-15 | 65—m 
С | 12 16 14 6 48 = С 


It should be noted at this point that the adjusted value for the missing 
variate is the same as that obtained in Section 4. 

The next step is to calculate the sum of squares for the fitted constants. 
This is given most easily by equation 4. Here we have 


SS (constants) = — 0.5 х 12 + 1.5 x 16 + 0.5 x 14— 1.5 x 6 
1.0 x 18 + 1.0 x 30 + 6.5 x 48 = 340.00 


This includes the correction for the mean which is 48?/7 = 329.14. 
Therefore the total for the constants is 340.00 — 329.14 = 10.86. Тһе 
meaning of this sum of squares is perhaps made more obvious by calcu- 
lating it directly from the adjusted values, omitting the one substituted 
for the missing variate. Thus 
2 
SE م‎ = 10.86 

We are now in a position to determine the error sum of squares, or if 
A and B are two factors in an experiment it will be the sum of squares 
for their interaction. This is the total sum of squares for the original 
values less the sum for the fitted constants. This is 


22.86 — 10.86 = 12.00 


which agrees exactly with that obtained in the previous section. The 
meaning of this sum of squares can be illustrated by calculating it directly 
from the differences between the original and the adjusted values. These 
differences are 


x—Y 


and X(X— Y} = 12.00. 
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The adjusted sums of squares are now very easily obtained from the 
identities given below together with the actual figures. 
Total (original values) — A (unadjusted) — Error = B (adjusted) 
22.86 — 3.86 — 12.00 = 7.00 
Total (original values) — B (unadjusted) — Error = A (adjusted) 
22.86 - 486 — 12.00 = 6.00 


It is even simpler to take 
Total (constants) — A (unadjusted) = B (adjusted) 
10.86 — 3:56:95 — 7.00 


Total (constants) — B (unadjusted) = A (adjusted) 
10.86 — 486 = 6.00 


The results obtained by this method agree exactly with those of the 
previous section, and the analysis of variance will of course be the same. 

For general application the formulas can be simplified considerably 
and the fitting of constants need not actually be carried out. The whole 
procedure can be developed from a knowledge of the estimate for the 
missing value. For a 2 x c table we will have 


Xie cat Ал e 12 


where R; is the total of the row containing the missing value and C; is 
the total of the column containing the missing value. This is given 
directly in our example by 
28 + 4C;—G 36 + 24— 48 

3 3 T 


We then have two sets of data referred to as the original values and the 
completed values. 


Xos 


4.0 


Original Values Completed Values 
B B 
ОБИ 2 3 Ко 0 1 2 3 Total Mean 
A e Sh 8 18 0|.4. 67-874 225 155 
Tode Oe CON SO 11:8 10-67 610308675 


Total 12 16 14 6) 48 Total 12 16 14 10| 52 
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An analysis of variance of the completed values can be set up as follows, 
and this serves to obtain the error sum of squares. 


ss DF 
A 8.0 1 
B 10.0 3 
Error 12.00 2 
Total 30.00 6 


In order to make tests of significance, adjustments must be made to the 
sums of squares for 4 and В. General formulas for these adjustments 
have been given by Yates [6, 9] and Anderson [2], and these will be 
referred to in Section 6. For the 2 x c table these are 

Adjustment to A = 1(X;; — X, 


Adjustment to B = 3(X;; — X,)? 


13 


where X,, is the estimate for the missing value, X, = Сз, X, = Ro/3, and 
the coefficients 1/; and ?/, arise from the dimensions of the table. Fora 
2 x c table the coefficient of (X;, — Х.) is always 1/;, and the coefficient ` 
of (X; — X,)* is (c— 1)/с. The divisor of A, in calculating X, is c— 1. 
In general, therefore, for a 2 x c table 


ems 14 
R; 
= 1 

X, 21 5 

Then 
Adjustment to A (rows of table) = $(X;,; — Xa}? 16 

and 

Adjustment to В (columns of table) = —— ly, - Х,)? 17 


Applying the adjustments to our example, we have 
Adjustment to A = 4(4 — 6)? = 2 
Adjustment to В = 2(4 — 6)? = 3 
Then the adjusted sums of squares are 
A= 8.0—2.0= 60 
В = 10.0 — 3.0.— 7.0 


and we have the same results as with previous methods. 
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6. The 2 x c table with more than one missing value. The methods of 
analysis for this situation are described in the following section on r x c 
tables and can be applied without difficulty to the 2 x c table. 

7. Missing values in ап r x c table. It should be apparent from a 
study of the previous sections in this chapter that the method of fitting 
constants is a general method that can be applied to any example provided 
that one can find a solution to the simultaneous equations. Fortunately, 
as pointed out by Yates [7], the problem of missing values can be solved 
merely by obtaining estimates for these values by methods that are based 
on the fitting of constants. The estimation of the missing values may be 
approached from three different standpoints although they all lead to the 
same results. These approaches are briefly: 

1. The substitution of a value that will make the error sum of squares 
а minimum, Allan and Wishart [1] and Yates [6, 7, 9] have developed 
this method, Yates having generalized the method and shown that it 
agrees with fitting constants. 

2. The fitting of constants, Yates [7]. The procedure gives an estimate 
of the missing value and furnishes a basic method for making tests of 
significance. 

3. The covariance technique. This was first suggested and applied by 
Bartlett [3]. Anderson [2] gives a generalization of the method, showing 
in detail how it can be applied to split-plot experiments. The principle 
is based on substituting another variable X, (the original values are Y), 
and making X = 0 for all pairs where Y is present and — І for the pair 
in which Y is missing.* The regression coefficient in the error line of the 
covariance analysis is the estimate of the missing value. This method 
also furnishes an easy derivation of the formula for the bias in a sum of 
squares to which a test of significance is to be applied. 

The methods that are most easily applied are demonstrated in the 
following examples. 

8. Example 14-1. One missing value in an r x c table. Table 14-1 
gives the yields of 6 varieties in a 4-replicate experiment for which 1 value 
is missing. This is referred to as the table of original values. 

The estimate of the missing value is given by 


„еб лб 


їй = буур сд) 18 


* Note that in previous sections the symbol X represented original values, The 
change to Y is made here because in the covariance method of estimating missing values 
the original values are presumed to be adjusted by a new variable X, and the symbolism 
adopted corresponds therefore with the symbolism of regression analysis. 
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where c represents the number of treatments and ғ the number of 
replicates. Here 
6 x 50.9 + 4 x 68.4 — 365.3 


14.25 
5. 5x3 
TABLE 14-1 
Treatment Block 
Block 1 2 3 4 5 6 Total 


18:5" 15:7 1620 14165130 136 91.1 
11.7 129." 144 160 125 68.4 = Rj 
15.4 166 155 203 184 21.6 | 107.8 
165. 31816 AZT ASE 16:57 180 98.0 


RUNS 


Treatment 62.1 509 573 645 648 657 3653=G 
Total 


We can now form a table of completed values, but only those figures need 
be entered where changes are made. This gives us Table 14-2. 


TABLE 14-2 
Treatment Block 
Block 1 2 3 4 3) 6 Total 
1 
2 14.25 82.65 
3 
4 
шеше 65.15 379.55 
Total 


We now set up an analysis of variance for the completed values. At the 
same time it is convenient to record the sum of squares and correction 
term for the original values. 


Total SS (original values) = 5948.89 — 5801.92 = 146.97 
Total SS (completed values) = 6151.95 — 6002.42 = 149.53 


The analysis of variance is given below. 


55 DF 
Blocks 56.76 3 
Treatments 12.58 5 
Error 80.19 14 


Total 149.53 22 


т 
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Accurate tests of significancé cannot be made at this point because of the 
bias in the sums of squares for blocks and treatments. A general formula 
is available for the calculation of the bias, which for the treatment sum 
of squares is 


(R, + cC,— GP (68.4 + 6 х 50.9 — 365.3)? 
cc — Dr— TÈ 270 


0.27 19 


but we shall also give the general method applicable to the case where 
more than 1 value is missing. The procedure is outlined below. 


DF SS MS 
Total (original values) 22 146.97 
— Error (completed values) 14 80.19 5.728 
= Blocks + Treatments 8 66.78 
— Blocks (unadjusted) 3 54.47 
= Treatments (adjusted) 5 12.31 2.462 


The unadjusted sum of squares for blocks is of course calculated from 
the original block totals. 


91.1? + 107.8? + 98.0? 7, 68.42 365.3? 


6 5 23 54.47 


The only remaining problem is to make the correct adjustment to the 
standard error of a difference between 2 means where 1 of the means 
contains a missing value. For comparing 2 treatments, one of which 
contains a missing value, we have the formula given by Yates for the 
standard error of a mean difference, 


eal 20 


giving in the example 


5.728 (2 6 ісін 
4 CIT i ay 


9. Example 14-2. More than 1 missing value in an r x c table. The 
data of Example 14-1 are repeated in Table 14-3 with 3 values missing. 


ae 


D 
v 
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TABLE 143 ^ 
Treatment Block 
Block 1 2 3 4 5 “6 Total 
pes sr wl ын We Puke ЛКЫ к Т noe eeu с -— 


1 |185 157 162 141 130 13.6] 91.1 
2. MET Ya 129) Yu UO ECON 
3 |154 166 155 203 184 216 101% 
4 | Yq 186 127 157 165 180] 815 


Treatment 


45.6 50.9 57.3 501 648 657 334.4 
Total 


Mean = 15.9 


The estimation of the missing values is accomplished by repeated 
applications of the formula for estimating single values. We start by 
estimating Ya, using an approximate value for Ур and Yay. А con- 
venient approximate value to begin with is the mean of the experiment. 
The method is illustrated below. 


First approximation 


R; = 81.5 С,= 45.6 С = 3344 + 2 x 15.9 = 366.2 
2 6 x 45.6 + 4 x 81.5 — 3662 15.6 
15 


К, = 54.0 + 15.9 = 69.9 C; = 509 С = 334.4 + 15.6 + 15.9 = 365.9 


6 x 50.9 + 4 x 69.9 — 365.9 
22 15 


Ry = 54.0 + 14.6 = 68.6 С,= 501 G = 334.4 + 15.6 + 14.6 = 364.6 
6х 50.1 +4 x 68.6 — 364.6 _ 


= 14.6 


0 
24 15 14 
Second approximation 
Rj = 81.5 C,—45.6 С = 3344 + 14.6 + 14.0 = 363.0 
6 x 45.6 + 4 x 81.5 — 363.0 
^ 15 15.8 


R, = 54.0 + 14.0 = 68.0 С, = 509 С = 334.4 + 15.8 + 14.0 = 364.2 


6 x 50.9 + 4 x 68.0 — 364.2 
15 


54.0 +14.2 = 68.2 С,= 501 G = 334.4 + 15.8 + 14.2 = 364.4 


6 x 50.1 + 4 x 68.2 — 3644 — 
15 P 


14.2 


13.9 
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Third approximation 
К, = 81.5 C,= 45.6 G=334.4 + 14.2 + 13.9 = 362.5 


P esiti 


R, = 54.0 + 13.9 = 67.9 С,=509 G=334.4 + 15.8 + 13.9 = 364.1 
Xa = 14.2 


R = 54.0 + 14.2 = 682 C,=50.1 G = 334.4 + 15.8 + 14.2 = 3644 
1959 


Since the second and third approximations give the same values, it is 
obvious that further calculations are unnecessary. 

A table of completed values is then set up and an analysis of variance 
made, keeping a record of the sum of squares of the original values for 
future work. We have 


Total SS (original values) = 5469.28 — 5324.92 = 144.36 
Total SS (completed values) = 6113.77 — 5962.95 = 150.82 


COMPLETED VALUES 


Treatment Block 
Block 1 2 3 4 5 6 Total 
1 
2 14.2 13.9 82.1 
3 
4 15.8 97.3 
Treatment 
Total 61.4 65.1 64.0 378.3 
The analysis is 
55 DF 
Blocks 58.34 3 
Treatments 12.75 
Error 79.73 12 


Total 150.82 20 


we 
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The next step is to obtain the adjusted sum of squares for treatments 
as given below. 


55 DF MS 
Total (original values) 144.36 20 
— Error (completed values) 79.73 12 6.644 
= Blocks + Treatments 64.63 8 
— Blocks (unadjusted) 52.54 3 
Treatments (adjusted) 12.09 3 2.418 


“Тһе unadjusted block sum of squares is obtained from the original values 
by calculating 


91.12 54.02 . 107.82 81.52 3344 
52,54 
6 5 - T 6 д 5 21 


To obtain the standard error of a mean difference between 2 treatments 
we must note how the missing values, if any, occur. The various possi- 
bilities are: 


1. No missing values. 


s= je 21 
| 
RU 
| 2. One treatment contains a missing value. We apply the formula 
given by Yates. 
„2 "^ 
" t [2 | ы | 2 
NET (c — 1)(r— 1) 


3. Both treatments contain missing values. Count the effective 
replication for a given treatment as | in each replication where the other 
| treatment is present, !/; when the other treatment is missing, and 0 when 
the value is missing. Thus, for comparing treatment 1 with 2, we have 


Replicate 


Treatment 
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Then 
1 Г) 
tl as бенен 23 
^ £ (5*3; 


10. Application of the covariance technique to the estimation of missing 
values. As pointed out briefly in Section 7 it is possible to obtain an 
estimate of a missing value by a procedure that arises out of the methods 
of the covariance analysis. This method is described in detail by 
Anderson [2] with special reference to its application to split-plot experi- 
ments. The general principles of the method can be demonstrated for a 
simple experiment, and once it is clearly understood it can be applied 
quite readily to more complex layouts. 

Assume a randomized block experiment with 2 replicates and 3 treat- 
ments from which the result for 1 plot is missing. We can set up the 
data as follows, where the yields are represented by Y and the concomitant 
variable by X which takes the value of — 1 for the missing plot and 0 
for all others. The treatments аге A,, Ao, and Aş, and the plot for 4, 
in replicate 1 is missing. 


А, А, A; Total 


: =] 0 0 EXT 
Replicate 1 
P "ores. one FG s 
( x |o 0 0 0 
Replicate 2 
| ¥ | Ұш YIN: 
PS 0 0 zs 
Total 
Qu AEA ae eun Mte 


In detail the total sum of products is seen to be 
=DE 
(-)x0-Qx Y94d--- x у C28 „© 


Then for A we have 


EDI OLCOTT co ZF 6 
2 6 ELA 


Similarly for replicates we have — R/3 + G/6. The error term is 


G- f; GO ReGen Tk 
65.270 625 3 = NONO 


GIL КОВ С 
6 6 


R 
du 
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To obtain regression coefficients we require the sums of squares of Х. It 
is easily seen that these are - s 


c)» 920 

Total (— D!—4———1—2—- 
С е 6 6 
4 m 
6 

B no, 
6 

: 2 
rror Е 


and the гше is that in any line of the covariance analysis the sum of 
squares of X is equal to the degrees of freedom divided by the total 
number of observations. 

The regression coefficient in the error line is 


ЗТ +2R—G 6 3Һ-2%-6 
s 6 2 2 


which is the well-known formula (see Section 8) for estimating the value 
of a missing plot for a randomized block experiment. 

A feature of this method is that it provides an easy solution to the bias 
in the sum of squares for treatments when this is computed from the 
completed values. We obtain a new estimate of the missing value that 
we shall represent as Y,, which comes from combining the sums of 
products, and the sums of squares of X, for treatments and error. Thus 


л G зт, LZR СКІ 


У(Х, Y) (sum of products) = -- 


2 6 6 3 
ОН 
2 =-+- =- 
У(Х?) (sum of squares) х + “=; 
Тһеп 
E RISS Кэ, К; 
Yy = 3*377 


and the bias in the sum of squares for treatments is 


2 Ry 
(Qr, — YEOH) = 5 (x. = 2 
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In general, for a randomized block experiment we have 


a rR cc pT G 24 
(p— ІХғ- 1) 
R 
OMA 
p D R 
giving J = 7. 25 
2) =” 
P 
=, (R+pT—G? 
Bac quo yee SIG 26 


p рр 100 — 1) 


where A = yield of replicate, T= yield of treatment containing the 
missing plot, G — total of all existing values, r — number of replicates, 
and p — number of treatments. 

Applying the same procedure to a Latin square, we get 


_rR+C+T)— 2G 


ы Nas "i 
У(Х У) к о)-6 28 
Xo) = = 29 
Bias = (Y, — Y? ( - ja Сең ا‎ dirt 


11. Disproportionate sub-class numbers. We have seen that non- 
orthogonality arises when values are missing, and methods were discussed 
that enable us to make appropriate tests of significance and extract the 
maximum amount of information from the data. Actually there is no 
clear line of distinction between data from which values are missing and 
data in which the numbers of determinations on which the various effects 
are based are unequal. Where the numbers are unequal, the examples 
can be divided into three general classes as follows. 

1. Single classification. For example, a series of lots of wheat have 
been taken from different areas and a determination of weight per bushel 
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made on each lot. There are different numbers of lots in the areas so 
that the data can be represented as: 


Grand Total 
Area A B Ç. Diveg 
Number Na Ny N: Жез М 
Total Ta T, 725 r pho no G 


It is of course a simple matter to make a test of the significance of the 
variation among the means for areas. The sum of squares for error is 
obtained by pooling the sums of squares for each area arising from 
deviations from the area mean. Тһе sum of squares for the means is 


Te Té Та Тг Puce g 32 
Ni NG N 


With single classification, therefore, we have no problem. 

2. Multiple classification. (a) Proportionate sub-class numbers. For 
an illustration of this condition we can again think of samples of wheat 
taken from different areas, but in this case there are 2 varieties and it has 
not been possible to obtain the same number of samples from each area 
and for each variety. The numbers, however, are proportional. The 
following is a typical example showing numbers on the left and totals on 
the right. 


Area 
A үн Ce ED 


Sy S, Su 
Sy Sac Sea 


68 


5 1 
34 Variety 2 


Variety 3 


18 36 27 21 102 


This example does not create any difficulties. The sum of squares for 
error comes from within sub-classes, a single sub-class, for example, being 
the 12 determinations of variety 1 in area 4. The sum of squares for 
areas would be obtained as for a single classification with unequal 
numbers, That is 

p TIED с? 
ТЗ ОООО ТОТО? 


The variety sum of squares would be 


(687, — 347, 
34 x 68 x 102 
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The interaction sum of squares could be determined by subtraction from 
the total for sub-classes given by 
Së C 


бы S E 33 
12 (512408. 7/5102 


or it could be calculated directly from 


(12$,— ES)? (245,- 12$, , .. (687, — 347, 


6х 12x18 ' 12x24 x 36 34 x 68 x 102 
The form of the analysis would be 
DF 
Areas 3 
Varieties 1 
Interaction 3 
Error 94 


and tests of significance made in the usual way. 
(b) Disproportionate sub-class numbers. With a similar example the 
numbers might be 


12 18 7 8 | 45 
6 15 13 14 | 48 


18 9378120 %522593 


and it is clear that there is no proportionality in either direction in the 
table. It is this lack of proportionality that introduces non-orthogonality, 
requiring methods of analysis somewhat different from those we are 
accustomed to apply to orthogonal data. 

12. Considerations arising out of unequal and disproportionate sub-class 
numbers. Since we are concerned here mainly with understanding the 
principles of the methods of analysis we employ rather than with the 
mere elaboration of appropriate techniques, it is important to examine 
the effect of unequal and disproportionate numbers on our method of 
deduction from experimental data. It may not be clear, for example, 
that it is necessary to make certain assumptions regarding the population 
we are sampling before we can decide how we are going to handle the data. 
A simple example will make it obvious that such assumptions must be 
made. Suppose that we have 2 samples of weights. One consists of a 
sample of 200 men with a mean weight of 160 pounds, and the other is a 
sample of 120 women with a mean weight of 120 pounds. What is the 


=> 
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best estimate we can make of the mean weight of the population from 
which these samples were drawn? This will depend on how the sexes 
are distributed. If we know that there are approximately equal numbers 
of the 2 sexes, the obvious conclusion is that the samples were not drawn 
in such a way that there was equal opportunity for both sexes to be 
included, and the best estimate is the unweighted mean, or 


160 + 120 
a = 140 


However, suppose that we have no good reason to think that the 2 sexes 
are equally represented in the population (for example, the weights might 
be taken from a certain occupational group where there is a predominance 
of men), and it is the mean weight of this group for which we wish to 
obtain an estimate. The obvious course is to assume that the numbers 
in the samples represent the approximate numbers in the population, and 
the best estimate is the weighted mean or 


(160 x 200) + (120 x 120) 
320 


145 


This example makes it clear that it may be necessary to make assumptions 
regarding the population before conclusions can be drawn. Actually 
this is always the case, regardless of unequal numbers. Strictly speaking, 
the mean of a sample is a good estimate of the mean of the population 
from which the sample was drawn only if the population is normally 
distributed. If the population is J-shaped, bimodal, decidedly skewed, or 
shows any definite abnormality, some statistic other than the mean may 
give a better estimate of the central tendency of the population. 

With disproportionate sub-class numbers there is always the problem 
of non-orthogonality, and, in addition, assumptions made regarding the 
population sampled have an important bearing on the method to be 
chosen for making tests of significance and obtaining estimates of the 
main effects and interactions. 

13. Effect of the presence or absence of interaction. Regardless of dis- 
proportionate numbers it should be obvious that the sub-class means are 
in themselves’ efficient estimates of the hypothetical sub-class means of 
the population sampled. This is so because the sub-classes may be 
regarded as a series of individual samples and each sample is capable of 
giving an estimate of the mean or the variance of the population. In the 
estimates of the variance, division of the sums of squares by the degrees 
of freedom in each sample ensures that the estimates are unbiased. 
Furthermore the sub-class means are capable of providing an estimate of 
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the variance of the population. It is only necessary in calculating the 
sum of squares to make due allowance for unequal numbers. It is 
therefore only in connection with the multiple classification of the data 
that difficulties arise, in other words, in obtaining estimates of the effects 
represented by the marginal means and the interaction. In order to 
illustrate this point it is desirable to set up a hypothetical example. 

Suppose that we have a 2 x 2 table in which the numbers are represented 
by n, the sub-class means by Y, and the sub-class totals by 5. Subscripts 
can be added to define the sub-classes and the marginal values. A table 
of numbers, means, and totals can be set up as follows 


Numbers Means Totals 
B B B 
дш Tha М, А Yi Pis | X. А Su Sa | Sı 
na na | М Fa Yoo | Yo. So Soo | Sa 
Ny Na| М T Fa | E ST SOLS 


where the marginal means here are the weighted means of the rows and 
columns. Thus Y, = (nı, Fı + maY,)/N,. If there is no interaction 
in the population, the individual variates can be represented by an equation 
of the form 

Y=a+b+m 


where a represents the effect of the level of A, b the effect of B, and m 
the general mean. There will be an error term for each variate so that 
strictly speaking we should have 


Y=a+b+m+x 


where x is the error term but can be omitted here without affecting the 
conclusions, Summing the various components оуег the Y’s in each cell 
and in the margin for the A totals, we have 


nia + by +m) па + by + m) №а + ybi + naba + Мт = Sy 


nala, + by + т) nola, + ba + т) №а + nabi + naba + Nam = Sp. 


Then 


34 


—— 
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The response to A is 
YY, la ee (s шар | te zs 35 


which very clearly represents the effect of A plus 2 terms containing the 
numbers апа b, and bg. 
We note first that if the numbers are equal or proportional, 


Mat ЬЕ Па а 36 
аус N 


the terms containing 5, and b, disappear. 
To note what happens when the numbers are disproportionate we can 
take a set of numbers as follows. 


nı =2 m=3 №. = 5 
ей O N= 
The terms in (35) containing b, and b, are 
(& — Db, + (f — Db, = 5% (5; — bo) 37 


or in general the term containing 5; and b, is 


Пу! — Miaz 

A E (b,— ba) 

№№. оен 

This shows that the response to B is reflected in the means for 4. 


Similarly we can show that the response to 4 is reflected in the means for 
B. Thus disproportionality is seen to have a definite effect on the 


‚ orthogonality of the А and В responses as measured by the weighted 


means. 
It is easy to see also that the unweighted means are not affected. For 
example 


Ya + Y, 

Unweighted mean (41) d = 2a, + by + by + m 
Ya + Y, 

Unweighted mean (45) = 3 2 — 2a, + b, + by + m 


And the difference is 2(45 — 4). 

We inquire next into the effect of the presence of an interaction. To do 
this it is simpler to demonstrate the effect numerically than it is to work 
it out in algebraic form. We shall set up а series of tables with varying 
degrees of interaction, the first representing zero interaction, and the 
numbers remaining as in the first table. In the first table we take a, = 2, 
а, 2.6. = 158 1, and т = 10. Then 28 = 4 x (— 2) + 4 x 
(— 1) + 4 x 10 and the mean is 28/4 = 7, and so on for the remaining 
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totals. In the other tables the means are changed so as to produce 
interactions. 
Unweighted Difference Weighted Difference 


Numbers Means Totals Means А,- 4 Means А,- 4 
-1 +1 
WM ST 29.28. 72 8 8.33 
- 3.92 
DI IO FOTM 1S .66 130 12 12.25 


4 8 Si E 5207558 8 9.00 


6 10 ПЗ 785110 12 11.75 


4 1.58 
10 15 Su SOL 90 12 11.25 


6 
4 8 1 15 4 120 8 10.33 

4 0.42 
6 10 17 7 102 70 12 10.75 


|; 8 3 13 12 104 8 9.67 


Note that the unweighted means remain the same throughout regardless 
of the interaction, whereas the weighted means not only change but, as the 
interaction increases, the response to A as measured by the difference 
between the weighted means decreases. It is easily verified that if the 
numbers are equal the unweighted and weighted means remain the same. 
By an algebraic procedure we can show that this is due to the interaction 
factor becoming entangled with means for the levels of A. 

This brief discussion should be sufficient to show that for a given set 
of data with disproportionate numbers it is important to take into account 
the existence or non-existence of an interaction. It serves as a basis for 
a complete understanding of the methods that are applied. The following 


statement by Yates [6] summarizes the points to be taken into consider- 
ation, 


It is frequently reasonable to suppose that the interactions between 
the main effects are negligible. If this is so, the most efficient estimates 
of the magnitude of the main effects may be made by fitting constants 
to represent these effects only. 

If the interactions cannot be ignored the efficient estimates of the 
main effects are the means of the sub-class means (assuming the 
multiple classification to be complete). In orthogonal experiments 
these estimates are precisely the same as those obtained on fitting 
constants to represent the main effects only. In orthogonal experi- 
ments, therefore, there is no need to consider whether the interactions 
are in fact negligible, when estimating the main effects. 
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14. Methods of analysis appropriate for different conditions. Since 
there is more than one method for the analysis of data with dispropor- 
tionate numbers, depending on different conditions and the assumptions 
made, the student will find it confusing unless the fundamentals outlined 
above are kept clearly in mind. Іп the first place the method known as 
the fitting of constants is basic to all other methods and would be applied 
almost exclusively except for the fact that other methods are less laborious, 
especially when the table contains more than 2 rows and columns. The 
first step, therefore, in a study of methods of analysis should be to become 
familiar with the method of fitting constants. 

The second general method also developed by Yates is known as the 
weighted squares of means. It also has a sound theoretical background 
and its main application is in the analysis of data in which it is impossible 
to assume that the interaction is negligible. 

15. Analysis of a 2 x 2 table with disproportionate sub-class numbers. 
In the first place we shall illustrate the application of the method of 
fitting constants. Assume that we have a 2 x 2 table as follows, using 
the notation described above. 


The first step in all methods is to make а preliminary analysis of differences 
between the sub-class means. For a 2 x 2 table we have a simple 
analysis of the form 


DF 
Sub-class means 3 
Error N—4 
Total N=1 


Obviously, if the sub-class means are not significantly different, there is 
not much object in proceeding further. If the differences are significant, 
the sub-class means can be compared and this may provide all the 
information required. However, it may be desirable to have independent 
estimates and tests of the A and B effects and the interaction AB. 

The fitting of constants follows the same general procedure as’ that 
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outlined for the treatment of data with missing values. Here we have the 


equations 
Sq = Ма + пуб, + naba + Мут 


So, = Ма» + nab, + то а + Nom 


Sa = Nab, + yd; + па; + Nam 


38 


Sa = Nba + rad; + па; + Мт 


Before illustrating how these simplify, we shall define 


n Ng, 
D. (2) sı + ma) EG (а) Sit: (аа) БИК 29 


пуп Пә һәр 
Then the simultaneous equations simplify to 
S, = Ry au Ry — аз Sa = C + bi Ci — ӘС» 41 
Sp. = Ra — 28 + Ry 53 = Co— ҺС + ӨС 42 
and since according to a general principle in fitting constants 
а +а=0 and Һ--5,-0 43 
further simplification gives 
„шл зга s 
We can then solve for m, obtaining 
m x а= (=z) 45 


In calculating Ву, Ry, Cı, and Cə, it is convenient to note that, if we 
write m/Nas malN.» тү] Му, and na/No. in the margins of the table, the 
values required can be obtained by multiplying and summating on the 
machine. 

The sum of squares for constants including the correction for the mean 
is given by : 
mS + (S1. — 5.)а + (S. — S.21 46 
The adjusted sums of squares for A, B, and the interaction are then 
obtained from 


Between sub-classes — Constants — Interaction 
Constants — A (unadjusted) = B (adjusted) 
Constants — B (unadjusted) = A (adjusted) 


(16) 2 x 2 TABLE USING WEIGHTED SQUARES OF MEANS 335 


The unadjusted sums of squares are of course given by 
(Ns. S. — N.82.) 


A Ge cd 4 

N,N,N " 

А MS. NSD Ж 
NaGN GN 


The procedure given above can actually be abbreviated slightly, but it 
is reproduced in full here because it follows the general procedure of 
fitting constants to larger tables. 

On testing the interaction and finding it insignificant, the constants for 
Aand B can be taken as efficient estimates of the 4 and B responses. 
Since a, + ay = 0, we can take 2a, to represent the A response and 2b, 
to represent the B response. Possibly we should not take the evidence 
of the significance of the interaction too literally. Even if the interaction 
is significant, if we have good reason to believe that the interaction in the 
population is negligible, the constants are still efficient estimates of the 
main effects. Also, the interaction may not actually prove significant, 
but if we have reason to think that a real interaction effect exists in the 
population the A and Bconstants will not be efficient estimates. Generally 
speaking, however, it is sound practice to let the test of significance of the 
interaction be at least a guide to methods of estimating the main effects. 

Assuming an interaction is present, more efficient tests of the main 
effects can be obtained by the method of weighted squares of means 
which is described below. 

16. Analysis of 2 x 2 table using weighted squares of means. The table 
is most conveniently set up as follows, wherein the main calculations 
required are indicated. 


в (н нй 
п 
my Үн Пі Үн 
А Тулу ШТ 
Mey Ya naz Ya 
ШІ 1/n2 
х B (М) Хе?) 
Wj X(w.j) 
Y; 
w. Y. Хи; Ў.) 


Y, is the unweighted mean of Уц and Y, etc., Y. is the unweighted 
mean of Y, and Yo, etc., w;. is the reciprocal of X(1/n) in the same line, 
and similarly for-w.;. 
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The sums of squares for the main effects are 


ға, Ewe YP 
а-о. Ew) | à 
% Rows FDP 
в-ро Хи.) | : 


The sum of squares for interaction can be obtained as in fitting constants, 
but since we are dealing here with the unweighted means it, is more 
conveniently calculated by setting up a column on the right containing 
values of d, which are differences between the unweighted means in the 
same row. The interaction sum of squares is 


(w.d. 
Z(W;.) 


17. Analysis of an r x 2 table with disproportionate sub-class numbers. 
In this notation the table will have r rows and 2 columns. It is con- 
venient generally to set the table up in this way rather than with с columns 
and 2 rows. 

The first step as with a 2 x 2 table is to make a preliminary analysis in 
order to test the significance of the mean square for sub-class means. 
Assuming this to be significant, we proceed to the more detailed analysis, 
illustrating first the method of fitting constants. The r x 2 table can be 
represented as follows, with the same notation as in the previous section. 
It is convenient to illustrate with a 3 x 2 table, and it will be found easy 
to extend the method for any number of rows. 


АВ = Х(ю,4,2) 51 


В 


The simultaneous equations аге 
51. = Nm + Nga + пуб, + mobs 
Sa. = No.m + №.а + Moby + Пәб 
Sg. = N3.m + №.аз + па + Пу 52 
S.a = Мут + Nb, + пца; + пла + падаз 
S.a = Nig + Nighy + пуза; + Mogao + пззаз 
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We then define 


7001 Па 1 ( ta) 
< (аа) Bush (аа) Pe Ns. Ss i 
and 
- (Turis Lu (se 
Са ( №. ч | №. ) [NS xi 
It can then be shown from the simultaneous equations that 
$4— С, 
ZALEA 5 
h 20 E 


and, since b, + by = 0, we have b, = — b,, and substituting in the first 
three of the simultaneous equations we get 
Sy. = bin = m» 


а+т= 
1 [a 
Sa. — bita — Paa) 
at а 56 
F Ns. 
Sg. — big — Па) 
ر ج‎ 
Ns. 
The sum of squares for the fitted constants is 
X(m + a)S;. + b(S4 — 5.2) 57 


which includes the correction for the mean. We require also the sums 


of squares for 
(NS. — N.28)? 


NNN 


B (unadjusted) = 58 


A (unadjusted) = x + N. + Na. —N 59 
The adjusted sums of squares orthogonal to the error are obtained from 
Between sub-classes — Constants = Interaction A B 
Constants — B (unadjusted) = A (adjusted) 
Constants — A (unadjusted) = B (adjusted) 


and the complete analysis can then be set up in the form below, where 
the degrees of freedom are given for the 3 x 2 and the r x 2 table. 


3x2 7х2 


DF DF 
A 2 кі 
B 1 1 
AB 2 LL 
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The full procedure of the method of fitting constants is given mainly 
for the purpose of illustrating principles. Certain abbreviations are 
possible in the actual analysis. For example, the first requisite of the 
analysis is to make a test of the interaction in order to decide on the 
method of estimating the main effects. It is convenient therefore to know 
that the sum of squares for B (adjusted) is given by С,.(26,)°. By adding 
this to A (unadjusted) we get the sum of squares for constants directly 
and the interaction sum of squares is obtained as above. In short 


(Between sub-classes) — Cy9(2b,)* — (A, unadjusted) = (Interaction) 


However, should the interaction prove to be insignificant, we will probably 
require the values of a; + m as estimates of the A effect. 

On the other hand, if the data indicate that the interaction cannot be 
neglected in the estimation of the main effects, we shall probably proceed 
to the analysis by the weighted squares of means. For this purpose the 
table is set up as below. The last 2 columns are not essential if we have 
calculated the interaction from the fitting of constants. Note the 
subscripts i = 1, 2, 3, and = 1, 2; and that Ў, and Y., are unweighted 


means. Also а. = (Ў — Yn) and w; = ЖП 


1 = 
та Yu na Ға z(a) Wy Fh юр. 4. Widi 


m Fa ma Ға 
ті 1na 
та Ya Па Fas 
ШТ ШУА 
т Ға та Ға 
ШТ HL 
1 
x B E(w) X», Y,) Low pds) 
W.g X(w.;) 
LM E 
ws Y. X(w.; Y.) 
From this table we calculate 
3 (и, Y.) 
А _aly у. Ў. Боби. Yr 
E(w. Y.) X(w,) 60 
S X(w., Y.) 
B=. [> №..Ў.2) — = 
с 61 
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where the numerical factor is the square of the number of rows or columns 

as the case may be. If the interaction sum of squares has not previously 

been calculated, we take 

[Z(w;.d, T 
X(w;.) 


AB = X(w,.d;2) 62 


The remainder of the analysis consists merely of setting up the sums of 
squares and degrees of freedom in the usual way and making tests of 
significance by comparison with the mean square for error arising from 
within sub-classes. 

18. Analysis of an r x c table with disproportionate sub-class numbers. 
The preliminary analysis of a table with more than 2 rows or columns 
will take the usual form, the mean square for sub-class means being 
tested against the mean square for error. If significant effects are present, 
the analysis must be by the fitting of constants if an efficient test of the 
interaction effect is to be made. 

A 4 x 3 table can be represented as follows, where i = 1, 2, 3, 4, and 


j= 1,2, 3. 
B 

na Nig па Ne Si. 
m; | Mn Па та | №. 5. 
т) | Mar Neg та | №. Sy. 

2 N 

т) | Mar "зз "зз з. з. 
Taj | па Па тз | №. E 


NN Na ^ Na N 
Selon! 6. 5 


The method of writing down the simultaneous equations is similar to 
that for the 2 x 2 and the r x 2 tables and therefore will not be given. 
It is sufficient to present the solution in terms of the constants. In the 
first place we define 
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We can then derive the simultaneous equations 
C', = b,Cy + ЫС + Сә 
C'a = БС + С» + БС 66 
C's = Сз + Сз + Сз 


The solution of these equations is facilitated by the fact that b, + b + b; 
— 0. Taking b = — b — by and substituting, we have 


C1 = b(C'n "m Cy) sus b(Ci i. Cys) 
C'a = b(Cig Са) + b(C s Сы) 


These can be solved by elimination, or in the determinant notation we 
can write 


67 


су (Са- С») 
by = 68 
(Са- Сз) (Сә Tt Cys) 
(Са — СС Cos) 
In other words in the general solution for, an r x c table one of the 


equations can be eliminated by this device. After the values of b have 
been determined, the values of a; are calculated from 


| cy (Cis UY Сз) 


S. пр _ Mob таба 
КЕСІМ Мү) М 
Sa. tardy _ Mogba _ nabs 
Wav Ng Мо 5а 
S. naby Па = g33 


а; т= 
е оо NESE Му 9 


а + т 


dı + m 
69 


Sa. Пар = Пара "арз 
NTN КОМ №. 


а + m= 
and finally 
È (a; + m) = 4m 

H 


After obtaining the constants, we can proceed to the calculation of the 
sums of squares. First, the sum of squares for constants is 


2 
X(a, + m)S,. + 36.58.) -F 1 
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and the interaction sum of squares comes from subtracting this from the 
sum of squares for sub-classes. The general arrangement of the analysis 
is as follows in 4 groups. 


DF 
Total N-1 
— Between sub-classes Té 
— Error N— re 
Between sub-classes ro—1 
— Constants r+c—2 
= Interaction (r— 1)(с— 1) 
Constants r-e—2 
— A (unadjusted) r—1 
— B (adjusted) c—1 
Constants r+c—2 
— B (unadjusted) с-і 
= A (adjusted) r—1 


Should the interaction prove to be significant and we are unable there- 
fore to neglect it, the test of significance of the main effects should be 
carried out by the weighted squares of means. The method of calculation 
is identical with that for the r x 2 table and does not require repetition. 
The actual calculations can be followed in the examples. 

19. Example 14-3. Analysis of an r x 2 table with disproportionate 
sub-class numbers. The data given in Table 14-4 are for the milk yields 
of dairy cows in 2 seasons, fall and winter, and spring and summer. The 
variation in numbers is due to the individual cow's not freshening with 
equal frequency in the two seasons. For example, cow 2 freshened 7 
times in the fall and winter season and only once in the spring and summer. 

From Table 14-4 we obtain the sums of squares for the preliminary 
analysis. This is 


SS DF MS F 5% Point 
Sub-classes 409,129 15 27215 271 1.93 
Error 402,860 40 10,072 
Total 811,989 55 


There is evidence that significant differences between sub-class means 
exist, so we proceed to a study first of the interaction effect, and then of 
the main effects. It is convenient to make a new table of numbers and 
totals as in Table 14-5, allowing a column for entering values of n,/N,. 
and another for a; + т to be entered at a later stage in the calculations. 
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TABLE 14-4 


DATA FOR MILK YIELDS, IN POUNDS, OF Cows FRESHENING IN 2 SEASONS 


Cow Fall and Winter Na Sa Spring and Summer Мә Sis 
1 378 591 2 969 336 296 2 632 
2 480 20 465 7 2447 262 1 262 
14 432 344 
342 
3 471 356 2 827 228 312 2 540 
7 404 168 2 572 343 170 192 4 890 
185 
9 141 135 1588 5 880 139 7155 245 3 539 
169 27 
11 $340 = va E “52103 44 38 26 4 1280 
30 343 354 342 
382 
21 2900 EG SI: 3 1184 38 418 458 3 1194 
27 SIS 501., 7402:-- 4 1753 56 313 293 5 1701 
277 434 155 
Total 32 11335 24 17038 
TABLE 14-5 


SUMMARIZED DATA FROM TABLE 14-4 


Cow na Sa Nig Фа Мг Si. nalN;. а + т 
1 2 969 2 632 4 1,601 0.500000 400.250 

2 7 2,447 1 262 8 2,709 0.875 000 312.888 

3 2 827 2 540 4 1,67 0.500000 341.750 
7 2 572 4 890 6 1,462 0.333 333 255.105 
9 5 880 3 539 8 1,419 0.625000 168.796 
11 7 2,703 4 1,280 11 3,983 0.636364 352.732 
21 3 1,184 3 1194 6 2,378 0.500000 396.333 
27 4 1,753 5 1,701 9 3,454 0.444 444 387.591 


Total 32 11,335 
МСА 50055 


ы 
ы 
т 
© 
© 
© 
wn 
za 
& 
nH „э 
S 
2 
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We calculate first 


С, = (0.500 000 x 1601) + (0.875 000 х 2709) + · · · + 
(0.444 444 x 3454) = 10,487.33 


Сіз = (0.500 000 х 2) + (0.875 000 x 7) + · * · + (0.444 444 x 4) 
= 12.351 01 


Then 
11,335 — 10,487.33 
pj uu а 
+ 2 x 12.35101 ыр 


At this stage the interaction sum of squares can be calculated. For this 


we require 
16012 2709? 34542 18,373 
| ENS! 
LEES m 56 315,516 


A (unadjusted) 
and 
B (adjusted) = Cy9(2b,)* = 12.351 01 x (68.6316)? = 58,177 


Then 
Interaction = (Sub-classes) — B (adjusted) — A (unadjusted) 
409,129 — 58,177 — 315,516 — 35,436 
The mean square for interaction is 35,436/7 — 5062, and since the error 
mean square is 10,072 there is no evidence of an interaction and we can 
complete the analysis by fitting constants. The sum of squares for 
fitting the constants is 
B (adjusted) + A (unadjusted) 
58,177 + 315,516 = 373,693 


We now calculate B (unadjusted). 
(11,335 x 24 — 7038 x 32) 


32 x 24 x 56 Ua cud 
Then, to obtain A (adjusted), 
Constants — B (unadjusted) 
373,693 — 50,979 = 322,714 


and the complete analysis can be set up. 
55 DF MS E 5% Point 


A 322,714 7 46,102 4.58 2.25 
B 58,177 1 58,177 5.78 4.08 
АВ 35,436 7 5,062 
Error 402,860 40 10,072 
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The A and В effects are both significant, so we shall calculate the values 
of a, + m. 
1601 — (34.3158 x 0) 


a + m= ^ — 400.250 
жаза BO) SM 
a + m= m — 341.750 
а + т p peu е 


6 


and so forth. 
Having calculated the constants, the method given above for calculating 
the sum of squares for the fitted constants can be verified. We take 


Ха, + m)S,. + (Sy — $4).— x = 6,254,222 -- 147,455 — 6,027,984 


= 373,693 


which is in perfect agreement with the figure obtained above. 

In the actual calculations it is desirable to follow a scheme as outlined 
below in order to keep the work to a minimum. 

1, Calculate sums of squares for total (T) and sub-classes (U). Then 
E = T — U, and make preliminary analysis. 

2. Calculate C, and Cj, giving һу. Then B,, the adjusted sum of 
squares for B, = C,,(2b,)*. 

3. Calculate A,, the unadjusted sum of squares for A, and obtain 
adjusted sum of squares for interaction AB, from AB, = U — (B, + Au). 

4. Adjusted sum of squares for A, (4,), is obtained from A, = 
B, — B, + A,. 

If the interaction proves to be significant, it is desirable to make 
estimates of the main effects from the unweighted means and to test the 
significance of these by the method of weighted squares of means. The 
necessary calculations are as shown in Table 14-6. Note that the table 
as set up enables us to complete the analysis by the weighted squares of 
means. The columns headed d, and w,.d,. are for the calculation of 
the interaction and would not be required if we had obtained this by 
fitting constants. The two procedures for obtaining the interaction are 


CALCULATION OF ANALYSIS OF VARIANCE COMPONENTS FOR АМ 8 х 2 TABLE BY WEIGHTED SQUARES OF MEANS 


TABLE 14-6 


(1 Ya + Ӯ, 
na Та па Тл > () Wi. di = Y4— Ға Wedi. тарат, we Y; 
| 
2 484.50 2 316.00 | 
I/n 0.500 000 0.500 000 1,000 000 1.000 00 168.50 168.500 400.250 400,250 
7 349.57 1 262.00 
ln 0.142 857 1.000 000 1.142 857 0.875 00 87.57 76.624 305.785 267.562 
2 413.50 2 270.00 
lin 0.500 000 0.500 000 1.000 000 1.000 00 143.50 143.500 341.750 341.750 
2 286.00 4 222.50 
Yin 0.500 000 0.250 000 0.750 000 1.33333 | 63.50 84.666 254.250 338.999 
5 176.00 3 179.67 | 
п 0.200 000 0.333 333 0.533 333 1.875 00 — 3.67 — 6.881 177.835 333.441 
7 386.14 4 320.00 
lin 0.142 857 0.250 000 0.392 857 2.545 46 66.14 168.357 353.070 898.726 
3 394.67 3 398.00 
1/п 0.333 333 0.333 333 0.666 666 1.50000 | — 3,33 — 4.995 396.335 594.502 
4 438.25 5 340.20 
Yin 0.250 000 0.200 000 0.450 000 222222 98.05 217.889 389.225 864.944 
E (;) 2.569 047 3.366 666 12.351 01 847.660 4040.174 
wey 0.389 249 + 0.297030 0.686 279 
P 366.079 288.546 
wy Fy 142496 + 85097 = 228.203 
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algebraically identical. The weighted squares of means gives the follow- 
ing analysis. 


үг 
AB = X(w,d,3) ҚА ӘР = 93,611.79 — 58,175.60 = 35,436 
E У(и,.) 
fs E(w. FP 
= Y, — Ee FT! — 4(1,393,890.5 — 1,321,592.8 
А-4 (к Ы) ( ) 
— 289,191 
y 2) 
В 64 (кеу үз- EIL = 64(76,895.21 — 75,882.56) 
X(w.j) 
— 64,810 
SS DF MS F 5% Point 
A 289,191 7 41.313 4.10 2.25 
B 64,810 1 64,810 6.43 4.08 
AB 35,436 7 5,062 
Error 402,860 40 10,072 


Since the interaction is small, the end result is very much the same as 
by the method of fitting constants. 

20. Example 14-4. Analysis of ап r x c table with disproportionate 
sub-class numbers. Table 14-7 presents in summary form data on the 
protein content of barley obtained in a survey made by the Grain Research 
Laboratory, Winnipeg. The B classification represents 3 grades, and the 
A classification 4 of the chief barley-producing areas of Manitoba. 


TABLE 14-7 


PROTEIN CONTENT OF BARLEY, IN PERCENTAGE, FOR SAMPLES OF 3 GRADES 
AND FROM 4 AREAS IN MANITOBA 


B, Ba B, 
Grade 2 Grade 3 Grade 6 
А |= 9 Sy = 962|my— 39 Sa— 4378 | ms= 36 5ш- 3927 
А|т4-41 5 = 451.5 | og = 145 Sa = 1628.5 | ла = 96 $4 = 1102.0 
Ag| aa = 10 Ss = 115.8 | ла = 27 Se= 307.1 | л = 105 Sis = 1207.9 
Ag| tay = 29 51 = 328.2 | mg = 31 5 = 359.8 | па = 99 5, = 11453 


The preliminary analysis of the data gave the following results. 


55 DF MS E 5% Point 
Sub-classes 29.7164 11 2.7015 4.98 1.81 
Error 355.4828 655 0.54272 


Total 385.1992 666 
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On setting up the means for the sub-classes as below, no very marked 
interaction is indicated, but at the same time it is difficult to come to the 


10.69 11.23 10.91 
11.01 11.23 11.48 
11.58 11.37 11.50 
11.32 11.61 11.57 


conclusion that the interaction is negligible. А complete analysis begin- 
ning with the fitting of constants in order to make a test of significance 
of the interaction seems to be necessary. For this purpose the data are 
set up as in Table 14-8. 


TABLE 14-8 
B 
па Па пз Ni nalNi. па/МЫ — nalNi- 5% 
9 39 36 84 0.107 1429 0.464 2857 0.428 5714 926.7 
A 41 145 96 282 0.145 3901 0.514 1844 0.340 4255 3182.0 
10 27 105 142 0.070 4225 0.190 1408 0.739 4366 1630.8 
29 31 99 159 0.182 3899 0.194 9686 0.622 6415 1833.3 
N.;| 89 242 336 667= № 
5.; 1991.7 2733.2 3847.9 | 7572.8 = S 
Referring to formulas (63), (64), and (65), we calculate 
C^ -—1011.1410 — 991.7 = 19.4410 Са = 12.9188 — 89 = 76.0812 
C^, = 2733.905870 — 2733.2 = 0.705870 C'a = 103.8417 — 242 = — 138,1583 
С» = 3827.7529 = 3847.9 = — 20.1471 С = 187,3918- 336 = - 148.6082 
С = 32.8156 Сз = 43.2656 Cag = 105.3427 
These values are substituted in the simultaneous equations (66). 
19.4410 = b,(— 76.0812) + b,(32.8156) + Б,(43.2656) 
0.705870 = 5,(32.8156) + ba(— 138.1583) + b4(105.3427) 
— 20.1471 = b,(43.2656) + (105.3427) + b,(— 148.6082) 
— 0.0002 0.0000 0.0000 0,0001 (Sum of columns) 


The sums of the coefficients in each column should equal zero except for 
errors in rounding decimals. 
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The next step is the solution of these equations, applying equations (67), 
derived from b, + b; + bj = 0. This gives 


19.4410 = b,(— 119.3468) + b(— 10.4500) 
„ 0.705870 = b,(— 72.5271) + by(— 243.5010) 


A simple solution then comes from the determinant method.* 


19.4410 - 10.4500 
0.705870 — 243.5010 | 545 5010 x 19.4410 + 10.4500 x 0.705870 
f Е 
- 119348 - 10.4500) 1193468 x 243.5010 — 72.5271 x 10.4500 
— 72.5271 — 243.5010 


= — 0.166996 


Substituting b; in the first equation, we have 
19.4410 = 0,166 996 x 119.3468 — 5,(10.4500) 
giving 
_ 0.4895 
* 10.4500 


= 0.046 833 


Finally 
by = — b, — b, = 0.166 996 — 0.046 833 = 0.120 163 


As a check we can substitute the determined values of 5,, ba, and b, in 
the simultaneous equations. 

Although they may not be essential, we shall find also the values of 
the constants dj, à», аз, а, and т. The simplest method is to solve 
equations (69) for a; + m. 

a, + m = 11.032 143 + 0.107 142 8 x 0.107 142 8 x 0.166 996 
— 0.464 285 7 х 0.046 842 — 0.428 571 4 x 0.120 154 
= 10.9768 


* Briefly stated, if 
сү = bx + boyy 


€, = Буха + baya 


then 
GET 
m Су Уз s 1 — 6 
= Тыў у VI VLA 
x | УА Хал 


(20) r x с TABLE WITH DISPROPORTIONATE NUMBERS 349 


Also 
a, + m = 11.2430 


a, + m = 11.3985 
а, + m = 11.4767 
The value of (а; + т) = 45.0950, and, since (а) = 0, we have 
45.0950 


m= = 11.2738 


The next step is to obtain the sum of squares for the fitted constants. 
5° 
X(a; + m)S;. + У(Ъ.;$.) — N 


85,576.3345 + 424.7692 — 85,977.9608 = 23.1429 


We then require 


2 2 
Laas T is Cae 85,977.9608 = 17.4500 


A dj = 5 
(unadjusted) 34 + 159 


91.72 3847.92 
+ n + 


77.9608 = 8.071 
89 336 85,977.9608 8.0710 


B (unadjusted) — 2 


and we get the adjusted sums of squares as follows. 
Sub-classes— Constants = Interaction 
29.7164 — 23.1429 = 65735 
Constants — A (unadjusted) = B (adjusted) 
23.1429 — 17.4500 к= 12:6929 
Constants — B (unadjusted) = A (adjusted) 
23.1429 - 8.0710 = 15.0719 


The complete analysis can then be set up. 


ss DF MS F 5% Point ® 
A 15.0719 3 5.0240 9.26 2.62 
B 5.6929 2 2.8464 5.24 3.01 
AB 6.5735 6 1.0956 2.02 241 


Error 355.4828 655 0.54272 
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The F for the interaction mean square is barely equal to the 5% point, 
but the existence of an interaction is certainly not inconceivable in this 
example, based not only on the results of the analysis but also on our 
knowledge of the behavior of protein contents of barley grades in previous 
experience. It is worth while, therefore, to examine the results of esti- 
mating the main effects from the weighted means and their significance 
from the weighted squares of means. 
The necessary calculations are given in Table 14-9. 


yy 
a. ,9 ік» y5— ле = 9(6377.8804 — 6376.2691) 
Law 
=9 x 1.6113 = 14.5017 
AY 
В 16 (қы?)- Еос = 16(3961.4736 — 3961.3245) 


= 2.3856 


АВ (тот method of fitting constants, see page 349) = 6.5735 


ANALYSIS—WEIGHTED SQUARES OF MEANS 


55 DF MS E 5% Point 
A 14.5017 3 4.8339 8.91 2.62 
B 2.3856 2 1.1928 2.20 3.01 
AB 6.5735 6 1.0956 2.02 2.11 
Error 355.4828 655 0.54272 


This analysis leaves some doubt as to the significance of the differences 
between grades since the F value is reduced to below the 5% point. 


21. Exercise. 


1, Table 14-10 gives the weights and numbers of lambs from 3 breeds of sheep over 
a period of 7 years. The table is in two sections, the left section showing the numbers 
and the right section the corresponding sub-totals. 

Use the method of fitting constants to set up an analysis of variance. From your 
results decide as to the suitability of the adjusted means for representing the effects 
of breed and season. If the adjusted means are not satisfactory, state the action that 
should be (акеп. The total sum of squares for the data is 113,232.0. 


b, = — 6.3464. b, 3.2066. b, = 9.5530. 


Mean squares are: Breeds 13,905 
Years 2,408.5 
Interaction 320.0 


Error 101.3 


TABLE 14-9 


1 
na Fa Па Ya N Y. = (; Wie Үс wi. Y; 
eSI 
9 10.6889 39 11.2256 36 10.9083 
l/n 0.111 111 1 0.025 641 0 0.027 777 8 0.164 529 9 6.077 923 10.9409 66.4979 
4l 11.0122 145 11.2310 96 11.4792 
1/n 0.024 390 2 0.006 896 6 0.010 416 7 0.041 703 5 23.978 803 11.2408 169.5409 
10 11.5800 27 11.3741 105 11.5038 
1/п 0.100 0000 0.037 0370 0.009 523 8 0.146 560 8 6.823 107 11.4860 78.3702 
29 11.3172 31 11.6065 99 11.5687 
1/п 0.034 4828 0.032 258 1 0.010 1010 0.076 8419 13.013 733 11.4975 149.6254 
1 49.893 566 564.0344 
> (7) 0.269 984 1 0.101 832 7 0.057 819 3 0.429 636 1 
W.j 3.703 922 9.820 028 17.295 263 30.819 213 
Y; 11.1496 11.3593 11.3650 
wf; 41.2972 111.5486 196.5607 349.4065 
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' TABLE 14-10 


WEIGHTS OF SINGLE FEMALE LAMBS FROM 3 BREEDS FOR 7 YEARS* 


Breed 
A B с 
Sa Sia Sis Si. 
1 Sy | 464 1,361 369 | 2,194 
Э, So; | 845 2,758 3,778 | 7,381 
3 Sy | 707 2,310: 4,281 | 7,298 
Year 4 Sy | 411 1,595 5,465 | 7,471 
5 Ss; | 756 1,618 4,003 | 6,377 
6 Se; | 489. 1,016 3,915 | 5,420 
7 S; |1227 1,798 6,813 | 9,838 
S.; |4,899 12,456 28,624 |45,979 
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* Data by courtesy of Dr. K. Rasmussen, Experimental Station, Lethbridge, Alta. 


CHAPTERTIS 


Goodness of Fit 


1. Two types of data. Generally speaking, the data from experi- 
ments, surveys, etc., may be classified into two types. To these types 
Snedecor [6] has applied the terms measurement and enumeration. We 
obtain measurement ‘data when we measure individual variates for 
characteristics such as weight, height, and yield. Enumeration data arises 
from recording the number of variates falling into certain classes that are 
either descriptive or numerical. The average yields for a set of fertilizer 
treatments would be typical measurement data. The numbers of white 
and brown chaff plants in a segregating population of wheat plants would 
be typical enumeration data. In this chapter and in Chapter 16 we are 
concerned chiefly with methods of estimation and the making of tests of 
significance for data of the enumeration type. 

2. Tests of goodness of fit. In many problems the test of significance 
requires the making of a comparison between a set of actual and a set 
of theoretical frequencies, the latter having been set up in accordance 
with the hypothesis that we wish to test. In an experiment in genetics 
an F, sample may be classified into 2 groups, as in a wheat experiment 
in which the F, sample of 132 plants is classified into 107 that are resistant 
to stem rust and 25 susceptible. These numbers suggest a simple 
genetic hypothesis, namely, that rust resistance is dependent on one pair 
of genes, with the gene for resistance dominant in the heterozygous 
condition. In order to test this hypothesis we must determine how 
often, if the hypothesis is true, we can expect to obtain a result that 
diverges from theory as much or more than the one observed. This 
would constitute a test of the goodness of fit of the actual to the 
theoretical ratio. 

3. Goodness of fit for twofold classifications—numbers small. In the 
example to which we have just referred, the plants were classified as 
resistant and susceptible, giving a typical twofold classification. Because 
of certain problems that are characteristic of tests of goodness of fit when 
we have only 2 classes, these are discussed first. In the first place we 
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should note that the varying conditions that may effect the structure of 


* the test are: 


1. Number in the sample. 
2. Symmetry of theoretical distribution. 


Let us suppose first that the number (и) in the sample is small, say 20, 
which is setting up a purely arbitrary dividing line and does not in any 
sense make a strict division between small and large samples. For 
enumeration data such small samples are rather rare, but suppose that a 
medical doctor has a record of only 16 cases with respect to a given 
disease and finds that 12 of these are men and 4 women.* It is perfectly 
legitimate for him to ask: Is this merely a random variation or does the 
population having this disease contain more men than women? Since 
the theoretical ratio is 1 : 1, or 8: 8 in the sample of 16, this is a problem 
of testing the goodness of fit of the actual frequencies 12:4 with the 
theoretical frequencies of 8:8. At this point the student will recall that 
we have already solved problems of this kind by means of the binomial 
distribution, (Chapter 3, Section 8). We require merely the separate 
probabilities of all possible ratios which are given by the terms of the 
expanded binomial (p, + pa)", where p, is the probability that a given 
individual in the sample will be a man, p, that the individual will be a 
woman, and л is the number in the sample. In this example л = 16, 

16 
ру = ps = 1», and therefore the binomial to be expanded is (; = 1) : 
The calculations are given in Example 15-1. 

4. Example 15-1. Goodness of fit of a 12:4 ratio to the theoretical 

8:8. The expansion of the binomial is 
1\16 
ДЕ 


түзе lys 16! (196 16! түнө 
2) Us) exa sis) 
Б Pe EO паро, 


and the terms are very readily obtained by means of a table of logarithms 
16 
of numbers and logarithms of factorials. Note that (5) is a constant 
16 
term. On the machine (5) is obtained directly by negative multiplica- 


tion of log 2 = 0.301 0300 by 16. The machine reads 995.183 520 0 


* The 16 cases must constitute a random selection from the population of patients 
suffering from the disease, Therefore any 16 cases available for study do not neces- 
sarily constitute a sample from which reliable results can be obtained. It is essential, 
for example, that men and women have an equal chance of being infected or included 
in the sample. 


يجج 5 
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which can be written down directly as 5.183 5200. We require only the 
first 5 terms of the expansion as given below. 


Probability of a 


Number of Men Particular Ratio P 
16 0.000 015 0.000 015 
15 0.000 244 0.000 259 
14 0.001 831 0.000 290 
13 0.008 545 0.010 635 
12 0.027 771 0.038 406 


In the third column P is the sum of the probabilities in the tail of the 
distribution. Thus 0.002 90 is the sum of the probabilities of the ratios 
14:2, 15:1, and 16:0. 

The data obtained enable us to test two different hypotheses. The 
probability 0.038 406 gives directly what is frequently referred to as a 
“one-tail” test. It tells us that, if a large number of samples of 16 are 
taken, the probability that the sample will contain 12 or more men is 
0,0384. The implication in this example is that men are more likely to 
contract the disease than women, and the hypothesis to be tested is that, 
in the population sampled, men are not more frequent than women. Setting 
up our level of significance at the 5% point, we decide that the data tend 
to disprove the hypothesis. To test the validity of the reasoning in this 
problem and all problems wherein we have to decide between a one-tail 
and a two-tail test, it is useful to imagine the consequences of dealing 
with a large number of similar samples. It is clear that only 3.8% of the 
samples will have more men than the one corresponding to the 5% point; 


` all others will have less and will be in agreement with the hypothesis. 


The samples that agree with the hypothesis will include those having 0 
men and 16 women, | man and 15 women, and others which show a 
definite tendency in the opposite direction, but we are not testing for such 
a tendency and hence these samples are regarded as merely agreeing with 
the hypothesis. 

The other hypothesis to be tested is that there is an equal sex ratio 
with respect to the incidence of the disease. The hypothesis is that men 
and women occur in equal numbers in the population sampled. On this 
basis we must consider both tails of the distribution because extremes in 
either direction disagree with the hypothesis. Our probability would be 
0.0384 x 2 — 0.0768, and taking the level of significance at the 5% point 
we would have to conclude that the data were not violently in disagree- 
ment with the hypothesis, but open to question. 

The question of which type of test to make is one that has caused con- 
siderable confusion, mainly on account of a lack of a clear understanding 
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of the hypothesis to be tested. Suppose, for example, that we decide 
to test the first of the two hypotheses mentioned. Then we proceed 
to investigate a series of such samples, and to those that contain a pre- 
ponderance of women we apply the same test in the opposite sense. 
Clearly we are utilizing for each test only one-half of the binomial 
distribution and as a result our level of significance should be set at the 
2.5% point rather than at the 5% point. à 

Tests of goodness of fit when the classification is twofold can always 
be made by means of the binomial distribution, but we are confined in 
the exact method that has been outlined to examples in which the numbers 
are small. With large numbers the computations are onerous unless 
there are only a very few probabilities to compute at one tail of the 
distribution, and in that event the result would be obviously significant 
and the calculation of probabilities unnecessary. 

With binomial distributions for which p, + p; and which are therefore 
skewed, the procedure is the same as given for the symmetrical binomial, 
but for a two-tailed test we have to calculate probabilities at both tails. 
For example, with n = 20, py = ?/,, p, = 1/4, the theoretical ratio would 
bel5:5. Fora sample giving a ratio of 18 : 2, the two-tailed test would 
involve the probabilities for the ratios 20: 0, 19: 1, 18:2, at one end, 
and 0:20, 1:19, + - ·, 12:8 at the other end. This would be a test of 
the probability of a deviation from the theoretical as great or greater 
than 4- 3. 

5. Goodness of fit for twofold classifications—numbers moderate to large. 
When the numbers are moderate to large, the most convenient goodness 
of fit test arises from the fact that the binomial distribution approximates 
to the normal distribution. This is true regardless of the inequality of 
pi and рз; that is, as the numbers increase the binomial distribution 
approaches the normal even if р, # рә, but if p, = ps the approach to 
the normal is more rapid and holds quite well for goodness of fit tests 
even when the numbers are small. 

The approach of the binomial to the normal distribution for the case 
pi Æ р» сап be gauged quite accurately by means of a, the measure of 
skewness discussed in Chapter 3, Section 8. We have 


Ps Pi 


= cm 1 


Vinpyps 


and it is clear that if p; = p, the skewness is zero. If p, 5 po however, 
the value of оз decreases fairly rapidly with increasing values of n. 

To make a goodness of fit test we calculate first the theoretical ratio 
t : f and determine | a, — 1 | or | a — t | where a, аз are the actual . 
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frequencies corresponding to the theoretical frequencies. Then we 
obtain cb 
s = Vnpps 2 

and 

=f 
[а= | 3 

5 
which may be defined as the relative deviate. The significance of d is 
determined by entering the table of the probability integral and finding 


d= 


Frequency 


là-&l = ee 


la- Tom 


ы 


rang 
вава si EL RUE MES P EA oss раға ШҮ, 


Figure 15-1. Frequency distribution of ($ + 3) and corresponding 
smooth curve. Shaded areas indicate the need for a correction to ® 
for small samples. 


the area of the normal curve beyond that point. If the table gives 
1/,(1 + a) we find 1 —*/,(1 + a) for a one-tail test and 2[1 — 1/.(1 + o] 
for a two-tail test. 

Thus far, however, the assumption is of a closer agreement between the 
histogram of the binomial distribution and the smooth normal curve than 


is actually the case. Reference to Figure 15-1 illustrates this point. It 
8 


represents the distribution obtained on expanding the binomial (5 255 |Б 


The probability of 6 or more successes in a trial of 8 events would be 
given by the ratio of the dotted area to that of the whole. If a test based 
on the smooth normal curve is made, the probability would be the ratio 
of the cross-hatched area to the whole. This is evidently less than the 
dotted area by an amount equal approximately to one-half the area of 
the 6:2 ratio column. Consequently the calculated probability will be 
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too low. It is also clear from the figure that an approximate correction 
for the discrepancy can be made by calculating the required probability 
from the area underneath the smooth curve to the right of the line 
bordering the 5:3 and 6:2 columns. This will be given as indicated іп 
the figure by taking | a, — 1; | — 1/ as the measure of the deviation from 
the theoretical. The working formula for the relative deviate will be 


сылы: 
5 


а = 


The work тау be shortened slightly by writing equation (4) in the form 
q- larel- € 12 5 
Vr(a, + а) 
where r = t/t = р1/рә- 

As the numbers become large there is less necessity for the correction, 
but in any event there will be a slight improvement, and consequently the 
making of. the correction should be a routine procedure. The only 
exception to this will be demonstrated in connection with a method to be 
described later wherein the discrepancies for a series of ratios are sum- 
mated and wherein the correction should not be made (see Section 9). 

With the method just described for testing goodness of fit it is also 
possible to set up fiducial limits. From equation (4) and taking 
405 = 1.96 from tables of the probability integral of the normal curve, 
we һауе 

[а= | = 4055 + à 
Then /, and /, for a, are 
lı = h + dooss + } 
l = h — doo58 — $ 

The methods described for testing goodness of fit and setting up fiducial 
limits for a twofold classification where the numbers are moderate to 
large are applied in Example 15-2. 

6. Example 15-2. Goodness of fit and fiducial limits for a twofold table 
— numbers moderate to large. The family of wheat plants referred to in 
the test above gave 107 resistant and 25 susceptible plants. This is to 
provide a test of the hypothesis that the genetic factor basis is such as to 
give а 3:1 ratio. Therefore п = 132, p, = 0.75, and p, = 0.25. Then 


s = V132 x 0.75 x 0.25 = 4.975 


|107—99|—1 75 
4.975 4.975 


d 1.51 
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Entering the tables of the probability integral, we find that 1.51 corresponds 
to «/2 = 0.4345; therefore P = 2 x (0.5 — 0.4345) = 0.1310. Although 
the fit is not particularly good, there is not sufficient information to 
disprove the hypothesis that the inheritance of rust resistance is based on 
a single pair of genes. 

To obtain fiducial limits at the 5% point we look up do; in the tables 
corresponding to P = 0.05. The value is 1.96. Then 


l = 107 + 1.96 x 4.975 + 3 = 117 
Ь = 107 — 1.96 x 4975-%- 97 


The approximate ratios for the fiducial limits are 117: 15 and 97: 35. 

7. Goodness of fit—2 or more classes. For a large proportion of the 
problems involving goodness of fit the number of classes is greater than 2. 
For example in a genetic problem involving 2 pairs of genes the theoretical 
ratio may be 9:3:3:1. In comparing the goodness of fit of an actual 
frequency distribution to the theoretical normal the number of classes 
may be as large as 20. For the general case we assume Еа 
classes for which the corresponding probabilities are ру, Pa, Ps» * * ^» Pio 
and the total frequency is n. Оп this basis it is seen that the binomial 
distribution with only 2 classes is merely a special case. Just as the 
theoretical frequencies when we have only 2 classes are given by the 
expansion of the binomial (p, + рә)”, the theoretical frequencies for the 
general case are given by the terms of the expansion of the multinomial 


(рі + ра + Ps ++ > ++ Pr)” 


The general term is 


n! 
m p, Nap, т n 
Баланды 7 
тітіт!..-- ny! 


For example, if we have a population of black, white, and red balls in 
the ratio 1:2: 1 we can set up 


ру (probability of drawing a black ball) = 2 
р» (probability of drawing a white ball) = $ 
Ра (probability of drawing a red ball) = 4 


Then, on drawing a sample of 8 balls, the probability of getting 1 black, 
3 white, and 4 red is 


Ке айз () ers 0.03418 
IVx 3rx 4t \4/ \2/ M 
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For а particular problem with more than 2 classes the multinomial 
expansion enables us to determine the probability of each of the possible 
combinations as we did by means of the binomial expansion for 2 classes. 
However, one of the difficulties in making tests of goodness of fit is to 
classify the different combinations with respect to the degree of divergence 
from the theoretical. The binomial problem was simple because we could 
represent the theoretical and actual frequencies by means of a 2-dimen- 
sional graph. With 3 classes we require a 3-dimensional graph and for 
more than 3 classes the problem is one of N-dimensional geometry. This 
problem can best be visualized by setting up the possible combinations 
for a problem in which л is small and there are only 3 classes. Suppose 
that n = 6 and p, = p, = p, = 7); The possible combinations and the 
corresponding probabilities are worked out below. For convenience the 
classes are represented by A, В, С. 


Combinations Probability Combinations Probability Combinations Probability 


ABC ABC ABC 
600 420 411 
060 0.001 372 402 141 0.041 152 
006 042 114 
240 0.020 576 
510 024 97971 
501 204 82152 
051 132 
Қ 4 
150 0.008 230 330 231 0.082 30: 
105 303 0.027 435 123 
015 033 213 
2:2 2 0.123 457 


The theoretical ratio being 2 : 2 : 2 for a sample of 6 drawn from a popula- 
tion for which p, = p, = p, = !/, a sample such as 1:4:1 would 
represent deviations of — 1, + 2, and — 1, and to locate such a sample 
on a graph would require a 3-dimensional figure taking one axis for A, 
one for B, and one for C. The selection of the samples having a given 
divergence from the theoretical is a problem therefore of defining a region 
within which all samples would agree with the hypothesis and outside of 
which all samples would disagree. This is not a simple problem, and it is 
unnecessary to go into it in detail here. For an excellent discussion of 
this point the student is referred to Smith and Duncan [5], Chapter XIII. 
However, even assuming a simple solution to the problem of defining 
regions of rejection and acceptance of the hypothesis, the computations 
would be tremendous for appreciable values of n and К. 
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The solution of our difficulty came with the discovery of the z? (chi 
square) distribution by Karl Pearson [4]. Its application to general 
problems involving various numbers of classes and with varying linear 
restrictions on the calculation of the theoretical frequencies was clarified 
by R. A. Fisher [2] who pointed out the necessity of taking the degrees 
of freedom into consideration. Pearson derived the theoretical distribu- 
tion for the statistic 7? given by 


PEx = = c 8 


тұр; 
where n; represents the numbers in the 1, 2,3,1," k classes for 


which the probabilities аге py, Pa, Pa» ° * 5 Pi ° ‘5 Pie Thus the values 
of p; are the theoretical and m; are the actual frequencies. In general 


we shall take dj, ds, * * "s Ai ° ° °, 4, to represent the actual frequencies, 
and ty, te, * * ‘st 7 t5, ty the theoretical frequencies,* so that 
-- 1; 2 
4-2 “| 9 
li 


By means of the statistic y? the problem of testing goodness of fit with 
more than 2 classes is very much simplified since the distribution can be 
represented in the,usual manner in 2 dimensions. 

In the derivation of 7? from the multinomial distribution the necessity 
arises of substituting "Stirling's approximation" for factorial expressions. 
This approximation holds only when the factorials are large; hence the 
z distribution is not accurate when the theoretical frequencies are small. 
The general rule is not to have theoretical frequencies fewer than 5. 
This is accomplished by keeping the numbers dealt with as large as 
possible, but should low theoretical frequencies occur in any event it is 
desirable to rearrange the table by combining classes or, in certain cases, 
by eliminating the low frequency classes entirely. A full discussion of 
this point is given by Cochran [1] wherein it will be noted that setting the 
minimum for a theoretical frequency at 5 is being quite conservative. 
Cochran also describes methods of obtaining a reliable test when not more 
than one of the theoretical frequencies is quite small. 

8. The 7? test for goodness of fit—2 classes. The test described in 
Section 5 for a twofold classification is fully adequate and gives the same 
result as the y? test. However, there are certain practical and theoretical 
advantages in the application of jd that will be brought out as we proceed. 


* t will always be represented with a subscript and therefore is not to be confused 
with the statistic f. 
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2 for 2 classes is equal to the square of d of Section 5. Therefore we 
have 
N (бүл а)? 


= 10 
2 ға, + а») 


As in calculating а, it is necessary to apply the correction for continuity; 
hence the corrected 7,2 is given by 


|а- ға|- C + jr 


P 2 = 11 
х: қа, + а) 


The necessity of making this correction was first pointed ош by Yates 
[7]. In this paper Yates discusses the need for the correction not only 
to a z? based on | degree of freedom but to a 7? based on 2 or more degrees 
of freedom. A further discussion of the need and methods of applying 
the correction is given by Cochran (11. 

For the simple twofold table there is only 1 degree of freedom for the 
estimation of 7?. This follows from the fact that the total is fixed, i.e., 
we are concerned with the possible variation in samples of size (a, + ag). 
In such samples only one of the frequencies can be varied at will; hence 
we have only | degree of freedom. 

For purposes of illustration we shall apply the 72 test to the data of 
Example 15-2. Неге we have a, = 107, а, = 25, and r = 3. Then 

2 
[аот - 3 x 25) - M 
ees 


3 х 132 55 


In order to determine the corresponding probability we enter Table А-4 
under n = 1 and note that P is approximately half-way between 0.20 
and 0.10. The actual value as determined in Example 15-2 is 0.13. 

It should be noted that the practical advantage here in calculating z? as 
compared to calculating dis that it is not necessary to extract a square root. 

9. The 7? test of heterogeneity and the addition theorem. An important 
property of the 7? distribution is that the sum of a series of independently 
determined values is itself distributed as 7? with degrees of freedom 
obtained from the sum of the degrees of freedom for the independent 
components. Thus, for a series of k samples, 


Sample 
1 D. Cu Ene Total 
au а аһ ад А, 
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each of these will provide a value of y? estimated with | degree of freedom. 


Similarly we have 
(41 re ry) 
алы 12 
Me A + Ay) 
also estimated with 1 degree of freedom, and 
te = (x) 13 


representing k degrees of freedom. We can now write 


ty =n X 14 
and (ће corresponding equation for degrees of freedom is 
(k—1)—k—1 15 


The value of y;? will be related to the variation in the ratios aj : dj». 
If these are all alike, y, = 0. If they are widely different, 7,* becomes 
large. It is a measure of heterogeneity and is frequently referred to as 
the "heterogeneity" 7?. 

In connection with the test for heterogeneity the correction for con- 
tinuity should not be made in obtaining the values of х that are summated 
(Cochran [1]. If a test of significance is to be applied to individual 
ratios or to 2, it is desirable to make the correction, but the addition 
theorem does not hold for corrected values. ; 

Example 15-3 illustrates the points brought out in this section and also 
a convenient method of calculating the heterogeneity y,*, given by 
Snedecor [6]. 

Since the distribution of 7* is dependent on the number of degrees of 
freedom, in theory we require a different probability table for each degree 
of freedom from 1 to оо. The table prepared by Fisher [2] (Table A-4) 
overcomes this difficulty for practical purposes by giving the values of 
Ж for a series of different probability levels. Thus for 2 degrees of 
freedom the 7? and corresponding values of P for a few levels are 


Е 099 095 0.90 0.50 010 0.05 0.01 0,001 
ж 002 010 0.21 1.39 460 599 92І 13.82 


For a particular problem, if we get a 22 of 7.21 we know that P is between 
0.05 and 0.01, and this is sufficiently accurate for ordinary tests of 


significance. 
The tables are made up for a range of degrees of freedom from | to 30. 


Beyond 30, we сап make use of the fact that- vag: is approximately 
normally distributed about a mean of V2n— | with unit standard 
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deviation, where n represents the degrees of freedom. This means that 
the relative deviate for entering the tables of the normal curve (Table A-1) 
is V252 — V/2n — 1. 

10. Example 15-3. Goodness of fit for sex ratios with multiple classifi- 
cation. The data in Table 15-1 are numbers of male and female sawflies 
(Cephus sp.) emerging from wheat stubble. Counts were made on a 
number of varieties at several stations, and the data given are for the 
first 4 varieties at 2 stations. 


TABLE 15-1% 


FREQUENCY OF MaLE AND FEMALE SAWFLIES EMERGING FROM WHEAT STUBBLE 


Swift Current Scott Both Stations 
Variety Male Female Male Female Male Female 
1 574 369 503 459 1077 828 
2 80 105 54 132 134 237 
3 86 60 107 81 193 141 
4 109 95 100 85 209 180 
Total 849 629 764 757 1613 1386 


We shall inquire first into the goodness of fit of the individual ratios to 
the theoretical 1 : | which is believed to be the situation existing in the 
normal wild population, The data indicate that the varieties һауе an 
effect on the sex ratio, and evidence of this can be obtained by a study of 
the heterogeneity of 7’. It is convenient to list the values of y? corres- 
ponding to the data in Table 15-1. These are given below. 


Variety Swift Current Scott Both Stations DF 

1 44.565** 2.012 32.546** 1 

2 3.378 32.710** 28.596%% 1 

3 4,630% 3.596 8.096** 1 

4 0.961 1.216 2.162 1 

Total Ж 53.534** 39.534** 71.400** 4 
Pooled zë 32.747** 0.032 17.182** 1 
Heterogeneity 2 20.787** 39.502** 54218** 3 


The zs that exceed the 57; point are marked with a single asterisk, and 
those exceeding the 1% point with a double asterisk. The results for the 
heterogeneity у,” leave no room for doubt with respect to the differential 


— — 


* Unpublished data furnished by A. W. Platt. 


| 
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effect of the varieties on the sex ratio. At least there is some factor 
causing a serious disturbance in the theoretical 1: 1 ratio, and this would 
appear to be a varietal effect. The results at Scott emphasize the need 
of care in interpreting the total y. This total exceeds the 1% point, 
but we note that the greater portion of this is due to variety 2. For the 
other 3 varieties the fit to a 1: 1 is reasonably good.* 

A further test can be made with respect to the effect of environment 
on the sex ratio. In each line we can total the 7"'s for the 2 stations, and 
on subtracting the value obtained for the stations combined we һауе a 
measure of heterogeneity due to environment. Тһе results are tabulated 
below. 


Variety Total Both Stations Heterogeneity 
1 46.577 32.546 14.031** 
2 36.088 28.596 7.492%% 
3 8.226 8.096 0.130 
4 2.177 2.162 0.015 


DF in each line = 2 1 1 


For varieties 1 and 2 there is marked evidence of heterogeneity in the 
ratios at the 2 stations. 

A convenient calculation procedure for the heterogeneity 72 when we 
wish to obtain it directly is given by Snedecor [6]. It is outlined below 
for the Swift Current data from Table 15-1. 


Number Proportion of 

of Flies Females Females 
N a Р 
943 369 0.391 304 
185 105 0.567 568 
146 60 0.410 959 
204 95 0.465 686 

Total 1478 629 0.425 575 = p = ты 


* Certain points should be noted in the analysis given above for the data of Table 
15-1. In the first place it might be argued that a test of the heterogeneity of the ratios 
with respect to their fit to a 1 : 1 ratio is not logical at least for the Swift Current data, 
in view of the apparent contradiction of the totals of 849 male to 629 female flies. A 
test based on a 1 : 1 ratio is, however, the test having the most specific meaning here, 
as there is a definite expectation of equal numbers of males and females. In addition, 
it should be noted that having established heterogeneity a more correct test of the 
pooled 7? is one based on its relation to the heterogeneity 72. Тһе latter can be con- 
verted into a variance based on 3 degrees of freedom and a test carried out as in the 
analysis of variance. Here, 20.787/3 — 6.929. Then F — 32.747/6.929 = 4.73, which 
is not significant for m = 1 and n = 3 degrees of freedom. 
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We calculate 
E(a,p) = 272.8835 


X(aj)p = 267.6867 


Lap) — У(а}р = 5.1968 


Then г: 
уг Хар)- X(a)p __ 5.1968 


Б РО 0.50 х 0.50 


= 20.787 


where P and Q represent the probabilities 1/, and 1/, for the theoretical 
1: 1 ratio. 

11. Example 15-4. Goodness of fit to a 9:3:3:1 ratio. In a cross 
between parents of the genetic constitution ВВсс and bbCC, the F, 
sample is classified as follows. The first row represents the actual and 
the second row the theoretical frequencies according to a 9 : 3 : 3 : 1 ratio. 


BC Bc Cb cb Total 
Actual 1260 625 610 5 2500 
Theoretical 1406.25 468.75 468.75 156.25 2500 
Deviations — 146.25 156.25 141.25 = 15125 0 


The actual results differ quite widely from the theoretical, and 7? calculated 
as follows is quite large. 

146.25? A 156.25? 8 141.252 151.25? 
1406.25 468.75 468.75 156.25 


T 256.2667 

On examining the data for the source of the disturbance, we see that the 
end classes are deficient and the middle classes too large. This is an 
indication of linkage, so we proceed to analyze the data further. Fisher 
[2] gives a method that is particularly appropriate here for dividing the 
total 72 into 3 components, each representing 1 degree of freedom. If 
the observed frequencies are a, a, аз, a, the y? components represented 
by B, C, and linkage are 

а, + a) — “(ау + а) 


B 3n 
C (а, + аз) — 3(a + a) 
3n 
uc Ado kt. 2 
Linkage [a, — За, — Заз + 9а,] 


9n 


(12) EXERCISES 367 


The figures are 


B xx — 0213 
(а ue = 0.0533 
Linkage 22,500 — 256.0000 
CNRC 


It would seem suitable in this example to set up a hypothesis based on 
linkage. From the actual frequencies, applying a method given by 
Fisher and Balkmukand [3], we estimate that there is 9% crossing over 
and on this basis calculate a new set of theoretical frequencies and 
determine a new value of 77. The calculations are given below. 


Actual Theoretical 
Class Frequency Frequency (a — ӨМІ 
BC 1260 1255 0.0199 
Вс 625 620 0.0403 
cB 610 620 0.1613 
be 5 5 0.0000 
Total 2° = 02215 


The theoretical frequencies for this test are obtained by using the figure 
of 9% crossing over which is calculated from the sample. This absorbs 
1 degree of freedom; therefore we have only 2 degrees of freedom left 
for the estimation of 7%. From Table А-4 we find that the 5% point of 
7? for 2 degrees of freedom is 5.99. Therefore there is no question of 
disagreement with the hypothesis. In fact the value of P is close to 0.90, 
and as close a fit would be expected in only 10% of the trials. This is 
not a sufficiently close fit, however, to give us much concern.” 


12. Exercises. 


1. Test the goodness of fit of observation to theory for the following ratios. Calcu- 
late 7? to 3 decimal places for each ratio, making the correction for continuity, and 


look up the approximate values of P from Table A-4. Then determine vy = y for 


* Very close fits are generally to be regarded with suspicion as they usually reflect 
some unnatural condition either in the design of the experiment or the manner in 
which the data are obtained. Note that the ideal good fit gives P = 0.5. 
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each ratio to 2 decimal places and, entering tables of the probability integral, Table А-1, 
determine the corresponding values of P to 4 decimal places. 


Theoretical 
Sample Observed Ratio Approximate Exact 
a аз r P p 
1 134 36 3 0.30 0.2892 
2 240 120 3 < 0.01 0.0004 
3 76 56 1 0.10 0.0990 
4 240 13 15 > 0.50 0.5484 


2. Find the fiducial limits for the ratios in Exercise 1. 
(1) 146:24 and 122:48. (2) 257:103 and 223: 137. 
(3) 88:44 and 64:68, (4) 248:5 апа 232:21. 


3. In a certain cross the types represented by BC, Bc, bC, and bc are expected to 
occur in a9:3:3:1 ratio. The observed frequencies were 


BC Be bc be 
102 16 35 7 


Determine the goodness of fit, and if the fit is poor analyze the data further to disclose 
the source of the discrepancy. See Example 15-4. 


Total y? = 9.87. Р = 0.02. Components: В у%= 0.133. P= 0.70. 
C = 9.633. P= <001. 
Linkage х? = 0.100. Р = 0.75. 


4. The data given below represent segregations іп 3 families of the same cross, Test 


the goodness of fit of each family to a 9 : 3: 3: 1 ratio and determine the heterogeneity 
2. 
Family AB Ab aB ab 


1 94 25 26 15 
2575 102 {К 36 5 
3 > SS 6 
4 tn? = 26.34. 
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CHAPTER 16 


Tests of Independence 


1. The nature of tests of independence. When enumeration data are 
classified in two ways, the data may indicate a state of dependence of one 
classification on the other. Table 16-1 contains data on 82 strains of 
oats which are divided into 2 groups according to the presence or absence 


TABLE 16-1 


CLASSIFICATION OF 82 STRAINS OF OATS FOR YIELD AND 
PRESENCE OR ABSENCE OF AWNS 


Yield Class, Weight in Grams 
151-200 201-250 251-325 Total 


Awned 6 7 21 34 
Awnless 18 21 9 48 
Total 24 28 30 82 


of awns and into 3 groups for yield. On examining the frequencies we 
observe a tendency for more of the awned strains to occur in the high- 
yielding classes than the awnless strains. To test the significance of a 
result of this sort we have to find the theoretical frequencies for the 6 cells 
of the table on the assumption that the classes are independent. The 
basis for the calculation of the theoretical frequencies must be the marginal 
totals. In other words we say that, on the assumption of complete 
independence, the frequencies in the cells should be proportional to both 
sets of marginal totals. For example, in the cell containing 6 strains, if 
we let 5, be the theoretical frequency, then 1,:24::34:82, giving 
Іш = (24 x 34)/82 = 9.9512. Similarly for all cells the products of the 
corresponding marginal totals, divided by the grand total, give the 
theoretical frequencies. Representing the actual frequencies by a, we have 
p= [= | 1 
t 
It is worth while to inquire closely into the nature of the extremely 
interesting and practical test of significance arising from this procedure. 
369 
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In the first place the marginal totals are accepted as part of the hypothesis, 
and consequently we are not testing anything in connection with them. 
For a population in which the distribution in the classes is as shown by 
the marginal totals and the classes are independent, we are asking what 
proportion of a large series of samples of 82 strains each will deviate as 
much or more from the theoretical as the one observed. This should 
clear up any difficulty that may present itself in connection with the method 
of calculating the theoretical frequencies. It is also obvious from these 
considerations that the frequencies in the margins together with the grand 
total are restrictions on degrees of freedom in the table. To find the 
number of degrees of freedom remaining for the estimation of Жай we 
have to do is to see how many of the cells of the table can be filled up 
arbitrarily. On trying this for Table 16-1 we find that after 2 of the cells 
are filled the remainder are fixed, leaving 2 degrees of freedom. In 
general in an r x c table there are (г — 1/с- 1) degrees of freedom for 
the estimation of 7? in a test of independence. 

2. Calculation of y? in r x c tables when r and с are both > 2. The 
generalized table may be represented as follows. 


Column Total 
Row а а ttt de К, 
аң аз аз ° ° ° dac Ra 
ап аљ аз ° ° ° Are R, 
Total G б, Go el G 
The corresponding theoretical frequencies аге ty), ty, * ° °, fpe, Which are 
calculated from 
NET 
п G 
RCs 
ta = 22 2 
RC. 
be = G 
etc. 


Then %? is given as usual by Х[(а — УД, and the degrees of freedom аге 


(r— Г(с- 1). 
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3. y? in an r x 2 table. Ther x 2 table is 


Column 
da аз Total 

14 a R 

Row 11 12 1 
2 аз da2 Ra 

i а аз К; 

г arı аљ R, 

Total C, с, с 


The shortest method is that given by Snedecor and Irwin [6]. We 


calculate 
p-&|2(#)-2| 5 
Gc, R; G 


wherein we can take either one of the columns for the basis of our calcu- 
lations. In other words the same result would be obtained from 


par È (=) 5 A 4 

с.с; Е, G 
With a slight variation we may also apply Snedecor’s formula as in 
Section 15-3. In applying this formula it is convenient to set the table 


up as follows. 


_ ain 
Pi Ri 
а К; Рі 
an Ra Pa 
4а R; Pi 
ад К, Pr 
Total Cı G 


We calculate Х(адр),р = C/G, and g = 1 —p. Then 
ee Хаар)- Сур 5 
Pq 
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This formula is almost the same as the опе in Example 15-3 for 
calculating the heterogeneity 72, but in that case the denominator pg came 
from the theoretical ratio for the goodness of fit test. It is important to 
know this as there may be examples in which we require д? by both 
methods and it is only necessary to change the denominator. 

4. 72 in 2 x 2 tables. Representing the 2 x 2 table by 


аһ aa | R 


аз ањ | К 


ССС 


we have 72 directly from 
Ға (andas — аза) 6 
С.С, 5,8; 


but, just as in testing simple ratios for goodness of fit, there is only 1 degree 
of freedom for estimating y? and an improvement is brought about by 
making a correction for continuity. The correction for a 2 x 2 table in 
a test of independence amounts to subtracting */, from each of a; and 
a and adding 1/, to 4) and аз, but this applies of course to the condition 
адау > йй. When аца» < амаз, the procedure is reversed. The 
easiest method is to incorporate the correction in the formula. For this 
we have 


6\2 
( ys — аза | — 2) G 
t 7 
: GGRR, 


showing that the correction G/2 must always reduce the numerical value 
ОЁ (441492 — аа). 

5. Exact tests for 2 x 2 tables. The discussion of Chapter 15, Section 
7, emphasized that the use of the 7? method for examples in which we 
have only 1 degree of freedom is justified only when the numbers are 
quite large, or, if the numbers are limited, when we apply the correction 
for continuity. R. A. Fisher [1] has pointed out that when the numbers 
are small it is frequently possible to apply an exact test based on the 
direct calculation of the probabilities in a 2 x 2 table. Representing the 
2 x 2 table by 


au аз | Ri 


а аз | Ra 
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Fisher has shown that the probability of obtaining this particular com- 
bination is 


елген? Ry! Re! ( 1 ) s 
G! ауу!аув!аоу!а»ә\ 


and with this in mind it may be comparatively easy to determine the 
separate probabilities for all the combinations required to make a test of 
significance. The procedure is most easily followed by its application to 
actual data as in Example 16-1. 

6. Example 16-1. An exact test in a 2 x 2 table. In a study of the 
blood groups of the North American Indians, Grant [2] obtained the 
results given below. 

Blood Group 


о А В АВ Total 


Fond du Lac 18 6 5 0 29 
Chipewyan 13 0 1 0 14 
Total 31 6 6 0 43 


The essential feature of the data from the standpoint of the possibility of 
hybridization with the white race was the classification into 2 main 
groups, О and not О. We form therefore a 2 x 2 table 


O NotO Total 


Fond du Lac 18 f 29 
Chipewyan 13 1 14 
Total 31 12 43 


The various combinations of this table that can occur provided that the 
marginal totals remain the same are 


2 18 и 19 10 208.09 


14 0! 13 1 12. -2 11 B 
What we actually require in this test is the sum of the probabilities for 
the first 2 combinations on the left. It is taken as a test of the presence 
of a greater proportion of white race genes in the Indian race from Fond 
du Lac than in those from Chipewyan. Any result such as that obtained 
in the combinations on the right may be taken merely as lack of evidence 


on this point or tending to disprove the hypothesis. 
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The calculations are 


31! x 12! x 29! x 14! 1 
43! 17! x 14! x 12! 


31! x 12! x 29! x 14! 1 
43! 18! x 11! x 13! 


For convenience in calculation we obtain first the logarithm of the 
constant factor which is 31.701 1593. The logs of the 2 terms to be 
subtracted are 34.171 813 9 and 33.201 777 0, giving 


Log term 1 = 3.529 3454 Тегт 1 = 0.003 383 
Log term 2 = 2.499 3823 Тепп 2 = 0.031 578 


Total = 0.0350 


It is of interest to calculate y? for this same table in order to note the 
agreement with the exact method. The lowest theoretical frequency in 
this table is (12 x 14)/43 = 3.91, and, since the general rule for у? is 
that the lowest theoretical frequency should be about 5, this table is 
evidently just below the requirements. Here we have 


(13 x 11 — 18 — 43/2)243 
31 x 12 x 14 x 29 


? 3.0499 


An easy method of obtaining the exact probability corresponding to a д? 
based on 1 degree of freedom is to find 


X = У3.0499 = 1.75 


which is equal to d, the relative deviate as described in Chapter 15, 
Section 5. Неге we have 


d=1.75 ..Қ1--0)- 0.9599 
giving 
Р = 1— ‡(1 + o) = 0.0401 


This is a reasonably close agreement and indicates that, when the correc- 
tion for continuity is made, the у? method can be depended on to give a 
fairly close approximation to the true probability even when the lowest 
frequency is 4. 
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7. Example 16-2. Test of independence in an r x 2 table. The data 
are those of Table 16-1. 


Ri аа 
Number of Number of 
Plants Awned Plants Pi 
24 6 0.250 000 
Yield class 28 7 0.250 000 
30 21 0.700 000 
82 34=C 0.414 634 = p 


Хаард = 17.9500 
Gp — 14.0976 


Difference — 3.8524 


9 = 1— 0.414 634 = 0.585 366 


3.8524 3.8524 _ 


2 = —— = 15.87 
X = 9414634 x 0.585 366 0.24271 5) 


The 1% point of 7? for 2 degrees of freedom is 9.21; therefore there is 
no question as to the significance of dependence in this table, or, in other 
words, that there is a significant increase in awned plants with higher yield. 


8. Exercises. 

1. Table 16-2 gives the data obtained during an epidemic of cholera (Greenwood 
and Yule [3]) on the effectiveness of inoculation as a means of preventing the disease. 
Test the hypothesis that inoculation has no prevention effect. 


TABLE 16-2 


FREQUENCIES OF ATTACKED AND Nor ATTACKED IN INOCULATED 
AND Nor INOCULATED GROUPS 


Not Attacked Attacked 
Inoculated 192 4 
Not Inoculated 113 34 
ر‎ = 3581. Р < 0.001. 


2. Kondra [4] studied the quality of feathering of chickens in an early and a late 
strain, obtaining the results given in Table 16-3, where A, B, and C are feather quality 


grade, A indicating those of superior quality. Study the dependence of earliness on 
feather quality in these data. 
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TABLE 16-3 


QUALITY OF FEATHERING OF EARLY AND Late STRAINS OF CHICKENS 


Quality of Tail Feathers 


1940 1941 

(а) Early Late (5) Early Late 
A 17 47 A 11 41 

B 111 137 B 74 64 

С 60 17 с 23 6 

Quality of Wing Feathers—1940 Quality of Back Feathers—1941 

(c) Early Late (d) Early Late 
A 10 26 A 23 26 

B 108 36 B 69 75 

с 70 39 с 52516 10 


x? = 40.41 (а); 27.96 (b); 18.73 (с); 1.78 (d). 


3. a. From a study of the position of the polar bodies іп the ova of the ferret, Main- 
land [5] gives the frequencies in the following table. 


Similar Different 
10и apart 5 1 
More than 10 apart 1 6 


Test the hypothesis that “similar” polar bodies are not more closely associated in 
position than “different” polar bodies. p = 0.025. calculated by direct method. 

b. Twenty-two animals are suffering from a disease, and the severity of the disease 
is about the same in each case. In order to test the therapeutic value of a serum, it 
is administered to 10 animals, 12 remaining uninoculated as a control. The results are 


Recovered Died 


Inoculated 7 3 
Not inoculated 3 9 


Determine the probability of obtaining this result or one more favourable to the 
treatment. P= 0.046. 

4. The data in Table 16-4 were obtained in a cross between a rust resistant and a 
susceptible variety of oats. The F, families were compared for rust reaction in the 
seedling stage and in the field under ordinary epidemic conditions. 


TABLE 16-4 


CLASSIFICATION OF SEEDLING AND FIELD REACTIONS OF 810 Fg FAMILIES OF OATS 


Seedling Reaction 
Resistant Segregating Susceptible 
7 Resistant 142 51 3 
ee Aon Segregating 13 404 2 
Susceptible 2 17 176 
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Calculate the value of 7? for this table, obtaining the theoretical frequencies to 2 decimal 
places of accuracy. Check the values of a— t by noting that their sums are zero in 
each row and each column. x = 1127.94. 
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АТРИ ЕРТ 


The Discriminant Function 


1. The use of multiple measurements to discriminate between two 
groups. It is frequently important in biological work, on examining a 
single individual, or a small sample of individuals, to be able to decide in 
which of 2 groups the individual or small sample belongs. In botanical 
studies the problem might be one of deciding on the species. In plant 
breeding the problem might be to decide whether a plant or plant progeny 
belongs to a high-yielding or a low-yielding group. 

Sometimes decisions can be made on the basis of a single variable, but. 
more often the 2 groups differ in several variables, each of which gives 
some indication as to the group in which the individual should be placed. 
The problem of utilizing 2 or more variables, however, is obviously not 
simple unless either one is sufficient in itself for discrimination, in which 
case it is superfluous to consider more than one. The general problem 
is to set up a function of the form 


Z =X, + AX + АХ + ° ° ° + AX 1 


where Ху, X», ° ° +, X, are the variables measured and 24, Ag, : : *, A, аге 
the corresponding weights. The simplest type of function would be 


Z= X% +X% +X% +X 2 


which assumes that the variables have the same mean and are of equal 
discriminating value. This is not likely to be the best discriminating 
function, however, as some of the variables may have much more dis- 
criminating power than others and should be weighted accordingly. 

A solution to this problem was given by Fisher [3] in 1936, wherein 
the method was demonstrated for separating species of iris employing the 
variables sepal length, sepal breadth, petal length, and petal width. 
Fisher has shown how to devise the coefficients of equation (1) such 
that, if we were to make an analysis of variance of the Z values, the 
ratio of the variance between species to that within species would be a 
maximum. 

Assuming that the 2 groups to be discriminated are A and B and that 
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there are n, sets of measurements in A and л, in В, the coefficients of the 
discriminant function arise from maximizing the ratio. 


Жыла Foes 
F(Z, —Z 5,20) 
1 1 


G 


The numerator of this ratio is the square of the difference between the 
means of Z for the 2 groups, and the denominator is the sum of squares 
within groups. Fisher shows that maximizing the ratio G yields a set of 
simultaneous equations of the following form, assuming 3 variables, 
Ху, Xa, and Хз, are available for study. 


AEQ) + (xx) + Хх) = di 
(xx) + АХ) + 7 (хуха) = dy 4 


AyX3x3) + ЉХ0әха) + (x) = ds 


where ху, Хә, and хз represent deviations from their respective group 
means represented by X, and X, and 


d, — Xa— Xn 
d, = Ха- Хы 5 
d, = Ха- Ха 


The difference Z, - Z, between the means for the groups, represented 
usually by D, is 
D = Ad, + Ada + Аз 6 


When D has been calculated, a test can be made of the significance of 
the discriminant functions by means of the analysis of variance, taking 


"n Sum of squares between groups for k degrees of 
(с) D? — freedom, where k = the number of characters 
Na + Mp measured 7 


Sum of squares within groups for n, + nm —k-—1 
D= 
degrees of freedom 


The partitioning of degrees of freedom is similar to that for multiple 
regression. The sum of squares for between groups is represented by k 
degrees of freedom because k of the 2 coefficients are calculated from the 
data and applied as in equation (6) for estimating D. 
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In the actual calculations we may wish to know the correlations between 
the variables, in which case it is convenient as shown by Cox and Martin 
[1] to carry out the calculations in terms of the correlation coefficients. 
Since 

Exa) T У(хуха) eh X(xoxs) 8 
а SGD) 0 VEGE © У) (а) 


it is possible to substitute for (ху), X(x,x,), and Z(xaxş) in equations 
(4), and on simplifying we have 


1 + aria + Mons = 41 
Myre + A2 + grag = d'a 9 
Mins t+ Meret is = 03 


where ee 
ді AV Xa?) а; = d, V X?) 
Ng = Уха) _ 4»=а/УУ(х?) 
Аз = АУ Eg?) а= 4] V Xx?) 
Finally 


D' = 1d" + ad's + 0303 = да + 0 + Agdy = D 10 
The details of calculation are given in Example 17-1 below. 


2. Example 17-1. The discriminant function applied to the differentia- 
tion of soil types. At the Iowa Agricultural Experiment Station, Cox and 
Martin [1] collected 286 samples of soil of which 100 contained the 
organism Azotobacter and 186 did not. Three characteristics of the soil 
were studied: 


X, = pH 
X, = amount of readily available phosphate 
X; = total nitrogen content 


A discriminant function was set up and was found to give a very good 
differentiation of the soils. In other words, knowledge of the 3 characters 
X,, Xa, and Хз could be combined in a discriminant function to give а 
very good indication as to whether or not the soil sample studied contained 
the organism Azotobacter. 
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A portion of the data from Cox and Martin is reproduced in Table 
17-1 and is used here for a demonstration of the calculations required in 
order to set up a discriminant function and test its significance. 


TABLE 17-1 


DATA FROM Cox AND MARTIN ON pH(X;), AVAILABLE PHOSPHATE CONTENT (Xj), 
AND TOTAL NITROGEN CONTENT (X3) OF 52 SAMPLES OF SOIL 


Group A (п, = 25) Group В (п = 27) 
containing Azotobacter without Azotobacter 

x X Хз x X, Ka 
6.0 46 24 | 6.2 49 30 
7.0 35 17 5.6 31 23 
8.4 115 28 5.8 42 22 
5.8 35 17 5.7 42 14 
6.9 55 25 6.2 40 23 
7.8 52 29 6.4 49 18 
7.8 52 29 5.8 31 17 
6.9 208 58 6.4 31 19 
70 70 13 54 62 26 
6.7 35 16 54 42 16 
62 21 44 57 35 22 
6.9 52 21 5.6 33 24 
8.0 60 58 5.8 24 15 
8.0 156 68 73 70 14 
8.0 90 37 6.1 21 21 
6.1 44 21 62 36 26 
74 207 31 6.7 59 26 
74 120 32 5.9 33 21 
84 65 43 5.6 25 32 
8.1 237 45 5.8 31 30 
8.3 57 60 6.1 30 24 
7.0 94 43 6.1 21 25 
8.5 86 40 57 35 22 
8.4 52 48 5.8 37 24 
7.9 146 52 5.8 28 19 
54 34 20 

5.8 16.9 19 

Total 184.9 2196 911 160.6 963 592 


Mean 7.3960 87.8400 36.4400 5.9481 35.6667 21.9259 
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It is convenient to set up the sums of squares, sums of products, and 
correlation coefficients in the form given in Table 17-2. Examples of the 
calculation of sums of squares and products are 


EX? = ХХ, + 3(X,?) = 1384.0500 + 959.7000 = 2343.7500 


QU (ХХ — (1849? , (160.6) 
25 2 25 27 


2322.7930 


С: 


E(X,X, = У(Х. Хо) + (ХХ) = 16,620.8000 + 5770.8000 


= 22,391.6000 
em ХХа ХХ», ХХы"ХХ, 184.9 x 2196 m 160.6 x 963 
E 25 21 25 27 
— 21,969.6827 . 
TABLE 17-2 


TABULAR ARRANGEMENT OF SUMS OF SQUARES, SUMS OF PRODUCTS, 
AND CORRELATION COEFFICIENTS 


У(Х?) 2343750 (Х.Х) 22,391.6000 Х(Х.Х) 10,407.3000 
Ст; 2322.7930 СЛ. 21,969.6827 СТ. 10,259.0597 
E(x?) 20.9570 Х(хуху) 421.9173 Xx) 148.2403 
V Xo) 4.5779 та 0.309113 T 0.414946 
E(X) 316,141.0000 X(X,X, 110,272.0000 

CT. 227,243.6400 СД; 101,136.9067 

EG) 88,897.3600 (xax) 9,135.0933 

УХ?) 298.1566 T 0.392608 

У(Х?) 52,267.0000 

GT, 46,176.9881 

E(x?) 6,090.0119 

VE) 78.0385 


The next step is to carry out the calculations shown in Table 17-3. 
For the method, see Chapter 8, Section 7. The calculations lead to the 
Gauss multipliers. It will be remembered that these form a matrix 
inverse to the matrix of the correlation coefficients. 
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TABLE 17-3 


CALCULATION OF INVERSE MATRIX TO CORRELATION COEFFICIENTS 


10 0.309113 0.414 946 1.0 0 0 
1.0 0.392 608 0 1.0 0 
1.0 0 0 1.0 
1.0 0.309113 0.414 946 1.0 0 0 
1.0 0.309113 0.414946 1.0 0 0 
0.904449 0.264343 — 0.309 113 1.0 0 
1.0 0.292270 — 0.341 769 1.105 646 0 
0.750560 - 0.324 602 — 0.292 270 1.0 
1.0 — 0.432 480 — 0.389 403 1.332 339 
1.246 029 — 0.215 369 — 0.432 480 
— 0.215 369 1.219457 - 0.389 403 
— 0.432480 — 0.389 403 1.332 339 


The matrix in the lower right portion of the table is 


си Сіз Сіз 
Сіз E Саз 
Сіз Сөз Саз 


These сап be checked by substituting them in the original equations. 
Thus 


(1.0 x 1.246 029) — (0.309 113 x 0.215 369) — (0.414 946 x 0.432 480) = 1.000 000 
— (0.309 113 x 0.215 369) + (1.0 x 1.219 457) — (0.392 608 x 0.389 403) = 1.000 000 
— (0.414 946 x 0.432 480) — (0.392 608 х 0.389 403) + (1.0 x 1.332 339) = 1.000 000 


We now require the values of d';, 4%, and d', These are calculated as 


follows. 
_ 7.3960 — 5.9481 


Ant = 0.316 280 
di 4.5779 
ғасы о ет. 
298.1566 
5 — 21.9259 
duci m ie Des Und 


78.0385 
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The coefficients 2’,, 7, and Ag, are obtained by multiplying each line of 
the inverse matrix by d',, d’s, 4%. 
2’, = (0.316 280 x 1.246 029) — (0.174 986 x 0.215 369) 
— (0.185 986 x 0.432 480) — 0.275 972 


— (0.316 280 x 0.215 369) + (0.174 986 x 1.219 457) 
— (0.185 986 x 0.389 403) — 0.072 847 5 


= — (0.316 280 x 0.432 480) — (0.174 986 x 0.389 403) 
+ (0.185 986 x 1.332 339) = 0.042 871 6 


Then 
0.275 972 
A= Fier КҮ 0.060 283 5 
0.072 847 5 
= "2981566 ^ 0.000 244 326 
0.042 871 6 
ALES Shag 0.000 549 365 


The required discriminant function is then 
Z = 0.060 28 X, + 0.000 244 3Х, + 0.000 549 4X, 
and for convenience we can divide through by 0.000 243 3 on the right, 


giving a new equation which we can again put equal to Z because it is 
the final discriminant function. 


Z = 247.8Х, + X, + 2.258X, 
This equation shows very clearly the relative values of the 3 characters 
in distinguishing the soil types. 
A complete analysis requires that we make a test of significance of the 
discriminant function. First we find D from equation (10). 


= (0.275 972 x 0.316 280) + (0.072 848 х 0.174 986) 
+ (0.042 872 x 0.185 986) = 0.108 005 


Then the sum of squares between groups is 


ЖЛ» ү, 255501 ДН 
(_ 42 -) D edis l2 (0.108 005)? = 0.151 42 
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Since the sum of squares within groups is equal to D, the analysis of 
variance can now be set up. 


55 DF MS F 
Between groups 0.151 42 3 0.050 47 224 
Within groups 0.108 00 48 0.002 25 


which leaves no doubt as to the discriminating power of the function. 

3, Applications of the discriminant function for discriminating between 
two groups. It is of course not sufficient in itself to calculate a discriminant 
function and determine its significance. We might have, for example, a 
case of 3 variables wherein | variable alone would give complete dis- 
crimination without the assistance of the remaining variables. In such 
an example the discriminant function would obviously be significant. 
The error would lie actually in the application of a more complicated 
function than required by the data. 

In other cases the discriminant function may be significant but some 
of the variables may not contribute anything of value to the function. 
Thus, if the function Z, = АХ, + 2,Х; gives a significant discrimination, 


+ it is most likely that the function Z, = 4, X + АХ + 2; Ху will also be 


significant even if X; does not in itself have any discriminating value. 

Since in obtaining Z, and Z, we absorb 2 and 3 degrees of freedom, 
respectively, for the discriminant functions, it should be clear that the 
same technique for testing the significance of the additional information 
given by X, as employed in multiple regression analysis (Chapter 8, 
Section 14) can be applied. The procedure is to test the significance of 
the mean square corresponding to the additional degree of freedom. 

4. Discriminant function for plant selection. The application of the 
discriminant function for plant selection was first described by Smith [5] 
in 1939. Smith points out that, in selecting for characters such as yield, 
differences due to genotype are very largely masked by non-heritable 
variations such as those due to soil and location. Plant breeders therefore 
select plants for yield on the basis of general vigor, number of culms, size 
of spike, etc., which they believe are associated with corresponding genes, 
but in doing this there is no basis for giving more or less weight to certain 
characters depending on the extent to which they really indicate a con- 


` centration of genes for yield. The method of approach suggested is the 


use of a discriminant function that will best indicate the genotypic value 


of a plant or line. 
As a basis for the development of the required discriminant function 


for plant selection it can be assumed that the genotype of a given plant 
can be represented by a function of the type 


y = aX + aKa + aX ttj 11 
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where X^, X's, X’3,- ° - are the values of the characters Xj, X, Xa - 
to be expected due to genotype, and ау, 4, ag, ` - аге weights depending 
on the economic values of the corresponding characters. For example 
it might be worked out that number of tillers contributes twice as 
much to yield as size of spike. These characters would therefore be 
weighted in the proportion of 2:1. In any event there is no way in 
practice of avoiding the use of weights because we do actually assign 
weights in arbitrarily deciding on the relative value of the different 
characters. 

Assuming that the genotype can be represented as in equation (11), the 
phenotype can be represented by 


ҮДА abe e 12 


and the problem is to derive values of b such that the regression of Y on 
y will be a maximum because selection of phenotypes using Y as a 
discriminant function will then ensure a maximum concentration of the 
desired genes in the plants or lines selected. It can be shown that, if the 
plants or lines are replicated, and 


1; represents variances and covariances for plants or lines 
е; represents variances and covariances for error 


Ei = fı; — е being an estimate of the component due to 
genotype 


that maximizing the regression of Y on y yields the simultaneous equations 
(for 3 variables). 


БМ + Dafa + Буз = A, 


bitia + batas + batas = А» 13 


bitig + batos + bats, = Аз 
where 
Ay = agu + agis + аз 


Az = iis + agos + ds 14 


Аз = difis + agos + 93833 


The 4’s are calculated from the data by substitution of the calculated 
values of g,; and the assigned values of a. These are inserted in equations 
(13) which are then solved for Б, by, and bs. 
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5. Assigning weights in equation for genotype. Since it is necessary to 
assign weights of some sort, there are in general two courses that are open. 
1. Without any basis for distinguishing between the characters with 
respect to their proportional economic value, the only course is to weight 
them equally. Since the means may differ, equal weighting is obtained 
by putting 
1 1 1 
A ERS = 


etc, 


a = — 
Т A 
2. Assuming that there is a fair basis for assessing the actual economic 
value of each character either by reasoning or shrewd judgment, we 
assign the weights by modifying those given above. Thus, if the relative 
values of X;, Xp, and X; are to be in the proportion of 2:1: 3, we take 


2 1 3 
Tie PAE Pacis 

In any event it should be emphasized that the assigning of weights does 
not consist of a more arbitrary procedure than that followed in ordinary 
selection where the plant breeder merely selects those plants that show 
desirable characters. For example, in selecting for yield, he must pay 
attention to number of tillers and size of spike. In doing so, he must 
either assume these to be of equal importance or arbitrarily weight them, 
for example by assuming that numbers of tillers is of somewhat more 
importance than size of spike. 


a 


If it is more convenient the discriminant function can be calculated from the corre- 
lation coefficients by following the procedure outlined below. 


Let ria = the! V тањо Fas = fos V taataa, ete. Then in equations (13) we can substitute 


ға У пи for tha, Fag У tantas for fa, etc., giving 


bitu + baria V haas + baris V ftaa = 41 
biria V filaa + balos + baras V taala = Аз 
бла V hatas + baras V toatss + balsa = As 


Dividing the first equation by Ут, the second by Ут» and the third by V fgg gives 


= E эс” 
bV tn + baria V faa + baria fas, = == 
Vin 

ші. cX 5ч А 
biria Vin + bV ta + byas Vis = igs 
із 

4; 


biris V fg ay bares Уш ds bs V fy = Rye 
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Then, letting b, = b, V ty, b'e = БУ, b's = bs V Tas, and 


Ay 
ZUR ted ee Ae 
Ут Ул» М 


the equations сап be re-written in the following form. 
b'i + ғады + rigb’s = А 
Toby + Бұ + rebs = Аз 
"аб + rabat bg = А, 


After calculating the с” matrix reciprocal to the r matrix, we have 


сп C12 Сіз 
" , , 
Сла Сз C33 
al с”, 

Сіз сз ©з 


The 4° are best calculated from 
Ay = аи аба + agis 
4» = аа + а + Magog 


Аз = diis + agas + sss 


and converted to 47, A’, and 4% by dividing by М, Vim and V fs. The b’ values 
can then be obtained from 

bi = Ас Алса + Аса 
and converted to b values by dividing by Vt, Уш, and Уш. 


6. Example 17-2. Use of discriminant function for plant selection. 
Table 17-4 gives the data obtained with 8 varieties of oats in a 3-replicate 
test. Root type is determined according to the scheme described by 
Hamilton [4]. The 3 characters measured contribute to lodging resistance, 
and the object of the analysis is to obtain a discriminant function that 
will discriminate between varieties for this character. 

Analyses of variance and covariance are made as follows, using Х, for 
the variance example and X, X, for the covariance example. 


(XD SS DF MS 
Replicates 2.400 833 2 
Varieties 46.259 583 7, 6.608 512 = ty, 
Error 4.799 167 14 0.342 798 = ец 
6.265 714 = gy 
(XiX) SP DF Cov 
Replicates — 115125 2 
Varieties — 14.0042 7 — 2.000 600 = ль 
Error 3.072 14 0.219 943 = е,» 


— 2.220 543 = pre 
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TABLE 174 


DATA ON Root TYPE, HEIGHT, AND DIAMETER OF CULM FOR 8 VARIETIES 
IN A 3-REPLICATE TEST 


ж Diameter 
Root Type Total Height Total of Culm Total 
(х) 69) (%) 
ae TES т ЕЕЕ: 


414 45% 30 ІК6 41. 45 44 130 3:32 194 ЖОЛДОП, 
3:0: ӘП” 737 1987 738^ 40% 42: 7120 3.3 34 32900959 
457 42 4% 132. 42 43 42: 127 ЗА 53:2 5) IS) 
А 3207 DONO 7142. 46, 549157137. о 
SEES T 5401622539 431 ANTS 2:8; 73:1 O 
23] 73:3. 1319- 21127 53 251017553: 157 3.0 ЗЕЛЕ 
8.0 65 61 206 46 46 45 137 27: 239 92:9 EE 
8:3 > 68 60 211 41 42 45 128 30 30 9302 94 


Total 41.8 367 362 114.7 342 356 361 1059 243 252 248 743 


The values of t;; е; 6; and ry are conveniently tabulated as in Table 
17-5. Note that, for convenience, the correlation coefficients are worked 
from the values of т. Thus 


E — 2,000 600 
12 6.608 512 x 44.946 429 


— — 0.116 081 


TABLE 17-5 


TABULATED VALUES OF fjj, е;), AND Fij 


lij eij 
6.608 512 — 2.000600 — 0.680 298 0.342 798 0.219943 - 0.012 887 
44.946 429 — 0.445.833 2.696 429 0.013 988 
0.113 7500 0.007 3214 
Sij lij 
6.265 714 — 2.220543 — 0.667 411 1.000000 — 0.116081 — 0.784 643 
42.250000 — 0.459 821 1.000000 - 0.197 174 
0.106 4286 1.000 000 


We must now decide on the values of a. The object in this experiment 
is to obtain a discriminant function for the selection of those varieties 
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having the greatest concentration of genes for resistance to lodging. It 
seems most logical here to assume that all 3 characters have equal weight; 
therefore 


1 1 1 


“SIT 4 = 44125 47 30938 
ог 
44.125 44.125 
сыбаға Ез Rc err ee Сыз 25 
pore Cus cl ^5 = 30958 
The fractions are of no value, so we take ау == 9, а = 1, and a, = — 14. 


Note that a, and a, are opposite in sign to a, because of the way the 3 
characters vary with lodging resistance. 

It is convenient to carry out the remainder of the calculations as in 
Table 17-6, 


TABLE 17-6 


CALCULATION OF DISCRIMINANT FUNCTION FOR PLANT SELECTION, 
DATA or TABLE 17-4 AND 17-5 


1.000 000 0.116 081 — 0.784 643 1.0 0 0 
1.000 000 — 0.197 174 0 1.0 0 

1.000 000 0 0 1.0 
1.0 — 0.116081 — 0.784 643 1.0 0 0 
1.0 — 0.116081 — 0.784 643 1.0 0 0 
0.986 525 — 0.288 256 0.116081 1.0 0 
1.0 — 0.292 193 0.117 667 1.013 659 0 


0.300 109 0.818 561 0.292 193 1.0 
1.0 2.727 546 0.973 623 3.332 123 


6.265 714 — 2.220 543 — 0.667 411 3.246 322 0.914 636 2.721 546 
8ij— 2.220543 42.250000 — 0.459 821 0.914 636 1.298 145 0.973 623 с;; 
— 0.667 411 — 0.459 821 0.106 429 2.727 546 0.973 623 3.332 123 


a 9 1 —14 A 63.5146 29.7026 - 7.9565 
Vt 2.5707 67042 0.3373 
A’ 24.7071 42813 — 23.5916 
Ж 19.7559 5.1864  — 7.0520 
2 7.693 0774 -2091 


10 1 x21 


"n 
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Note that-the first step is to calculate the inverse matrix of the correlation 
coefficients. Then the values of g;; and а), a, and a; are inserted in the 
table. From these we get 


A= 9 x 6.265 714—1 x 2.220 543 + 14 х 0.667411 = 63.5146 
Ay =—9 x 2.220 543 + 1 x 42.250 000 + 14 x 0.459 821 = 28.7026 


As =—9 x 0.667 411 — 1 x 0.459 821 — 14 x 0.106 429 = — 7.9565 
We then enter the square root of fy, әр, and /зз. Using these as divisors 
gives 471, 4%, and 4. Next we calculate 

2'1 = 24.7071 x 3.246 322 + 4.2813 х 0.914 636 — 23.5916 х 2.727 546 
—219 77750 


Ag = 24.7071 х 0.914 636 + 4.2813 x 1.298 145 — 23.5916 х 0.973 623 
— 5.1864 


۸'4 = 24.7071 x 2.727 546 + 4.2813 x 0.973 623 — 23.5916 х 3.332 123 


= — 7.0520 
Then 

= шо 
2.5707 

pe ари 
6.7042 

= 10800 15091 
0.337 26 


The last line of the table shows the coefficients converted by dividing by 
2, and writing down with the elimination of decimals. 
Applying the discriminant function to the means of the first and last 


varieties in Table 17-4, we have 


10 x 3.87 + 43.3— 27 x 3.33 = — 7.91 


10 x 7.03 + 42.7— 27 x 3.03 = 31.19 
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TABLE 17-7 


VALUES OF EGG CHARACTERS FOR 2 GRADES 


[17] 


Yolk Albumen Albumen 


Grade A Grade B 
Yolk Yolk Albumen Albumen | Yolk 
Shadow Color Index Height | Shadow Color Index 

А, Xa X, X, х X, X, 
8 13 30 17 12 12 40 
7 11 30 20 9 14 20 
7 14 35 14 10 15 25 
7 14 20 25 13 17 35 
7 13 20 25 10 17 30 
8 16 20 28 10 15 25 
6 13 25 26 10 17 30 
8 16 20 27 12 16 35 
8 14 20 26 10 17 20 
8 16 25 24 12 15 35 
si) 12 20 27 10 14 25 
8 17 20 25 11 15 45 
E 13 25 24 10 16 25 
7 15 15 36 9 15 25 
7 14 20 26 12 16 35 
7 8 15 34 10 15 25 
7 14 20 27 12 16 35 
6 15 15 33 1 14 30 
8 14 20 24 9 13 25 
6 14 20 26 10 16 30 
8 13 20 28 10 17 25 
6 13 25 24 10 14 25 
qi 16 20 31 10 18 30 
6 14 20 26 11 14 30 
8 16 20 28 10 15 30 
9 18 20 

9. 14 20 

8 16 25 

10 15 25 

11 13 25 

П 15 35 

10 15 25 

11 16 25 


Height 


¢ 


а 
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7. Exercises. 


1. The data of Table 17-7 are for samples of eggs of grade A and В. They аге 
taken from a study made by A. Johnston, Poultry Division, Central Experimental 
Farm, Ottawa, on the value of 4 egg characters in determining grade. Each set of 
values is made up of means corresponding to one farm sampled for egg quality. This 
table gives the basic data required for the calculation of the discriminant function. 
Calculate 21, 25, дз, and A, as in Example 17-1 and set up the actual equation corre- 
sponding to equation (7). 

Make an analysis of variance and determine the significance of the discriminant 
function. : 

Calculate Z for 4 samples from grade A and 4 from grade B. 
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CHAPTER 18 


Probit Analysis 


1. Types of data appropriate for probit analysis. A great deal of 
biological work is now concerned with the assay of drugs, vitamins, sera, 
and other forms of stimuli. In these experiments the effect of the stimulus 
is determined according to the reaction of living organisms. Usually the 
stimulus is applied at a series of levels and the reaction of each level 
determined from its application to a batch of subjects. 

Instead of speaking in general terms it is more convenient, in describing 
the kind of data obtained, to refer to one type of experiment such as the 
determination of the toxicity of a chemical preparation to a given type of 
insect. In such an experiment various concentrations are prepared and 
a batch of insects assigned at random to each concentration level. The 
substance is applied, and for each batch a count is made of the total 
number of insects (л) and the number killed (ғ). The results can be 
expressed either as a proportion (г/л) or as a percentage 100(r/n). This 
is the type of data to which the methods of the probit analysis are appro- 
priate in assessing the value of different toxic substances, comparing 
relative potencies of different substances, applying tests of significance, 
and determining fiducial limits. 

2. The distribution of tolerances. For any one subject there is a level 
of intensity of the stimulus below which response does not occur and 
above which it does occur. This level is referred to as the tolerance for 
that subject. For example, in considering a given insect there is a level 
of concentration for a certain poison such that if the concentration is less 
than this level the insect will live, but if the concentration is greater the 
insect will die.* Let this tolerance of a subject be represented by 2; 
then in a population of subjects we are concerned with the distribution 
of A. It is generally true for most biological preparations that the 
distribution of 2 is not normal but is at least approximately normal for 


* It should be noted that this is the ideal situation from the standpoint of a mathe- 
matical model but one that is not strictly true on account of variability in the responses 
of individual insects. 
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X = logy) 4. In the usual analysis, therefore, we deal with the log of 
the concentration. Following Finney [5], 2 is referred to as the dose in 
terms of actual concentration such as milligrams per liter, and the 10210 
concentration is referred to as the dosage or dose metameter. 

It should be noted that the log transformation does not necessarily 
normalize the distribution of tolerances. In certain cases other trans- 
formations are more appropriate, but the log transformation is found to 
apply in most experiments and there are chemical and biological reasons 
why this should be true. 

3. The probit transformation. Assuming that the dosage in a popula- 
tion is normally distributed, we can picture the situation as in Figure 18-1. 
In the upper portion of the figure we have a normal curve in which the 
dosage deviations from the mean (X — X) are replaced on the base line 
by (X — X)/o or the normal equivalent deviate (N.E.D.). In other words 
this can be regarded as a normal distribution with unit standard deviation. 
Above the graph of the normal distribution are the percentages of the 
total area from a given point on the base line to N.E.D. — — oo. For 
example, at N.E.D. = — 1, the area to the left of that point is 15.87% of 
the total. These percentages are represented in the lower half of the 
figure by the sigmoid curve. 

It should be clear from the figure that, if we have a population in 
which the tolerance for a toxic substance is normally distributed and we 
apply the substance at a series of levels to different batches, the percentages 
killed when plotted against dosage should give a sigmoid curve. The 
analysis of results from the sigmoid curve present some rather serious 
difficulties; therefore it would seem to be desirable to use a transformation 
of the percentages such that with a normal distribution the transformed 
percentages would lie on a straight line. The obvious transformation for 
this purpose is the corresponding N.E.D. Thus, on obtaining a percen- 
tage kill at a given level of 15.87%, we would locate in a table the N.E.D. 
such that if the distribution is normal a kill of exactly 15.87% would be 
expected.. The value would be — 1. For a kill of 2.27% the N.E.D. 
would be — 2, and so forth. 

The suggestion that the N.E.D. be used as a transformation for per- 
centage response seems to have been first made by Fechner [4] in 1860, 
but it was not considered seriously until it was again made by Gaddum [7] 
in 1933. Later, іп 1934, Bliss |І, 2] suggested adding 5 to the N.E.D. 
in order to remove negative numbers, and he also suggested the term 
probit. At that time and since then Dr. Bliss has been responsible for a 
great deal of the development work with the probit transformation. At 
the present the best summary of the theory and application of probit 
analysis is Probit Analysis by D. J. Finney [5]. 
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In Figure 18-1 the probit scale is shown on the right of the lower part 
of the figure and the straight line represents the probits of the percentages 
at all points along the normal curve. Б 


002 062 668 3085 69.15 93.32 99.38 99.98 
0 0.13 227 15.87 5000 8413 97.73 99.87 100.00 
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FIGURE 18-1, Illustrating theoretical distribution of tolerances, and 
relation of percentage kill to probits. 


It should now be clear that, in a typical experiment wherein the 
tolerances are normally distributed, a graph of the probits corresponding 
to determined percentages will tend to be a straight line. Any variation 
from the normal curve will cause the plotted probits to vary from a 
straight line. Generally, the observed variations from a straight line are 
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of two types. In the first place the batches of subjects may not be all 
uniform or the conditions for the batches may not be uniform. This 
will tend to produce an abnormal scatter of the points about the straight 
line. In the second place the transformation of the dose to dosage may 
not be suitable. Usually this will be indicated by trends towards curvi- 
linear rather than linear regression. 

4. Practical applications of probit analysis. Since we expect to get a 
straight-line graph when probits are plotted against dosage, the methods 
of linear regression are suggested. What we have to decide, however, is 
whether or not such methods are applicable: if they will yield estimates 
of required population parameters, if valid tests of significance can be 
applied, and if fiducial limits can be determined. 

First we require a measure of the potency of the preparation. It has 
been concluded generally that the dosage giving a 50% kill is the most 
valuable statistic. It is referred to as the LD 50, (median lethal dose). 
In experiments where the response is not death we refer to the ED 50 
(median effective dose). Whatever practical advantages there may be in 
knowing the LD 90 or some similar value, the fact is that much greater 
precision can be obtained in the measurement of the LD 50. This is 
obvious when we consider the measurement of a very high dosage such 
as the LD 100. Any levels administered beyond this point would give 
no information whatever. In the case of the LD 50, levels above and 
below contribute equally to the result. 

Another factor to be measured is the range of the dosage required for 
a given range of percentage kill. This might be referred to as the sensi- 
tivity of the preparation tested. Obviously if small changes in concen- 
trations give a wide range in the percentage kill, the sensitivity is high. 

Referring now to a probit graph such as Figure 18-1, which will be 
described in detail later, it is clear that the LD 50 with respect to dosage is 
given by the dosage that corresponds to a probit value of 5.0. Therefore, 
ifa straight line can be fitted, the LD 50 can be read directly from the graph. 

Again on referring to the graph it is evident that sensitivity will be 
represented by the slope of the line. Тһе greater the slope, the narrower 
the range in dosage for a given range in the percentage kill. 

The geometry of the line would seem to give us therefore the required 
measures of potency and sensitivity. Locating 2 points X; and X, 
representing dosage, оп the abscissa of the graph and finding the corres- 
ponding points Y, and Y, on the probit scale will give the slope of the 
line. If b represents the slope, we have 


Ү,- Y, 


bl: o 
X25) 
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This makes it possible to set up a regression equation of the type 
Y= a + bX, where a= Y, — bX, or Ү,- bX, This line can also be 
used to locate the LD 50 which we now represent by the symbol m. 
Putting Y = 5 and m for X, the regression equation is 5 = a + bm, which 
can be solved for т. > 

5. Fitting a probit regression line. The fitting of a suitable straight-line 
regression would seem from the argument above to provide estimates of 
the required parameters. On making a close study of the problem of 
fitting such a line, however, we find that there is an essential difference 
between fitting a line to probit values and fitting a regression line to 
measure the relation between 2 variables in the ordinary case. It will be 
recalled from Chapter 6 that an essential assumption in the fitting of a 
regression line and testing its goodness of fit is that the variance of Y, 
the dependent variable, at all levels of X, the independent variable, should 
be the same. For the probit regression line this is definitely not the case. 
Actually, the variance is a minimum at the LD 50 and goes to infinity, 
on one end at the 100% kill, and on the other end at the 0% kill. In 
order to fit a regression line accurately it is necessary to weight the values 
at each point by the inverse of the variance. It has been shown that, if 
P represents the probability of kill at a given dosage level and О = (1 — P), 
the probability of survival, the correct weighting coefficient w = Z?/PQ, 
where Z is the ordinate of the normal distribution corresponding to the 
probability P. Taking Y as a probit value on the straight line, the 
values of Z?/PQ have been tabulated by Bliss [2] and reproduced by 
Fisher and Yates [6]. To illustrate the use of the weighting coefficients 
we must calculate the mean X from the formula X(nwX)/X(nw) in place 
of the usual formula X(nX)/X(n), where n represents the total number of 
insects in a batch. 

One difficulty that arises in applying the weighting coefficients is that 
they are based on Z, P, and Q, which are parameters of the population 
and not estimates from the data of the experiment. In other words we 
must theoretically know the equation for the straight line in order to 
determine the values of w, which of course is an impossibility. In actual 
practice two things can be done to overcome the difficulty. First, many 
examples are such that a sufficiently good straight line can be fitted by 
obtaining what is known as an eye-fit. A straight edge is laid along the 
graph of probit points, and, if it is obvious that a close fit to the points 
can be obtained, this line is drawn and used for obtaining the required 
measures of potency and sensitivity, and further refinement is unnecessary. 
Second, a line fitted by eye (provisional line) will provide first approxima- 
tions of Y and consequently of w. Substituting the approximate values 
of Y and w, another line is fitted which is a better fit than the provisional 
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line. If further refinement is required, a third line can be fitted using the 
second line to provide estimates of Y and w. This process, if repeated, 
will lead very quickly to a line that does not change on further applications 
of the fitting process. This is known as an iterative process, and by 
suitable methods of fitting it can be made to give the maximum likelihood 
solution. It may appear to be tedious, but in actual practice there is 
little need to go beyond the second line. 

The details of fitting and the formulas required are best observed from 
the actual examples given below. The same is true for learning the 
meaning of the terms and the interpretation of the data. 


7.0 


10 11 12 13 14 15 16 17 18 19 20 
X = log, (concentration x 100) 


Ficure 18-2. Line fitted by eye to data of Table 18-1. 


6. Example 18-1. Fitting a provisional probit regression line. Table 
18-І is taken from data given by Morrison [10] for the effect of different 
concentrations of nicotine sulphate in a 1% saponin solution, on Droso- 
phila melanogaster. The various steps in the calculations are best carried 
out as enumerated below. 

1. Set up the data as in the first 3 columns of Table 18-1 and calculate 
% kill as in column (4). 

2. Determine log concentrations after multiplying the concentrations 
by a suitable factor to remove negative values. These are entered in 
column (5). 

3. Look up empirical probits in tables by Fisher and Yates, entering 
the table under % kill. Record these in column (6). 

4. Draw Figure 18-2. First represent the empirical probits by dots, 
and then draw the straight line to give a good eye-fit. In this example it 
is fairly easy to draw a suitable line. One point that should be kept in 
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mind when it is difficult to decide where the line should be placed is that 
the most attention should be paid to those points representing kills of 
40 to 60%, and percentage kills outside of the limits of 16 to 84 should 
be practically disregarded.* 

5. From the regression line determine: 

а. Тһе LD 50 and LD 90. The LD 50 corresponds to a probit of 5.0, 
and the LD 90 to a probit of 6.28. In this figure we get m = LD 50 
= 1.37 and LD 90 = 1.86, 

b. The regression coefficient. Locate 2 convenient points at the ends 
of the line and write down corresponding values of X and Y. Thus 


X, = 1.0 Ү, = 4.02 
X, = 2.0 Y, = 6.66 
s { 27. $ Y— Y, 
Then the increase in Y for unit increase in Y = = 2.64 — b. 
X — X, 


6. In the regression equation Y = a + bX, put a= Y, — bX, = 1.38 
and b — 2.64, giving the equation 


Y = 1.38 + 2.04 X 


and from this equation obtain values of Y falling on the straight line 
for the levels of X required. The X’s are entered in column (1) of Table 
18-2, and (ће Y's in column (2). If the figure is accurately drawn, the 
Y values can be read from the figure instead of calculating them from 
the equation. 

7. Reading backwards from Fisher and Yates’ table of probits, find the 
corresponding values of P which are entered in column (3). Then 
complete the entries in columns (4) and (5), copying from Table 18-1. 

8. Calculate the entries in columns (6), (7), and (8) as indicated. Note 
that P is a proportion and not a percentage. Thus a percentage of 16.4 
is entered as 0.164. Add the last column to obtain 7? = 0.769. This 
4 is based on 3 degrees of freedom since we have adjusted for m and 5. 
From Table А-4 we note that a 7? of 0.769 for 3 degrees of freedom 
corresponds to a probability of about 0.85. This is a better fit than can 
ordinarily be expected but is not sufficiently close to lead us to be 
suspicious of the data. 

9. Tests of significance can now be applied to b and т, and fiducial 
limits calculated. The required preliminary calculations are shown in 


* In Figure 18-2 it is possible that the line should have been drawn with more slope, 
giving greater weight to the percentage kill of 62.5 and 81.5. 
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Table 18-3. Note that Y is entered to the first decimal place only. The 
values of Y can be read to the second decimal place from the regression 
graph, but it can be shown that such a degree of accuracy is unnecessary 
atthis point. The fourth column contains the weighting coefficients that 
are determined from Fisher and Yates’ Table XI. 

We require the sum of squares of X given by 


3i 2 
X(nwx?) = E(nwX2) — PUDE Ў _ 31.5319 2 
X(nw) 
Then the variance of 5 is 
1 1 

ыы == 003 3 

v, E(nwx?) 31.5319 н 
S, = V0.031 714 = 0.178 4 


and the fiducial limits are given by 
b+ ts, 


where / = 1.96. If y* had indicated significant heterogeneity, it would 
have been necessary to make a correction in obtaining the fiducial limits. 
This will be demonstrated after first making the usual calculation of the 
limits. 
2.64 + 1.96 x 0.178 = 2.64 + 0.35 
giving 2.29 to 2.99 as the required limits. 
If heterogeneity is present, we calculate the heterogeneity factor 
^ = pp 5 


where the degrees of freedom = k — 2, k being the number of batches 
tested. Then the variance of b is 


8:77. 
^C X(nwx*) x 
and =: 
s'y um МУ 
The fiducial limits are then 
b+ ts’, 


where t is the value required for significance at the 5% level for k — 2 
degrees of freedom. 
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To get the fiducial limits of m we calculate 


1 1 (m— Xy 


E ег 1 
С b2 БЕ 3r X(nwx?) | 
1 | 1. (137— | 
2.642 310.46 ' 31.5319 
— 0.000 626 4 
Sm = 'V/0.000 626 4 = 0.025 

The fiducial limits are then 

т + ts 8 


m 


where ¢ = 1.96 if there is no significant evidence of heterogeneity. Here 
we have 


1.37 + 1.96 x 0.025 = 1.37 0.05 


giving 1.42 to 1.32 as the required limits. 

The value of m and its fiducial limits should be expressed in actual 
concentrations, so we find their antilogs and divide by 100 to get back to 
the original scale. We get 


LD 50 = 0.234 
with fiducial limits 0.263 and 0.209. 
A similar calculation gives the fiducial limits of the LD 90. 
1 1 (1.86 — Ө 
y, = Қаза ад 22 
Noa [oz "31.5319 SPO SING 


Then 
1.86 + 1.96 х 0.03 = 1.86 +- 0.06 


giving 1.92 to 1.80 as the fiducial limits. The corresponding doses are 
LD 90 = 0.724 


with fiducial limits 0.832 and 0.631. 

Two points should be noted here in connection with fiducial limits. 
In the first place, formula (7) is merely a close approximation; and in 
the second place a correction for heterogeneity must be made whenever 
22 is significant. The approximate formula is sufficiently accurate for 
this example, and also there is no evidence of heterogeneity, so further 
corrections will not be made. The exact formula for fiducial limits and the 
use of the heterogeneity factor will be demonstrated in the next example. 


TABLE 18-1 
DATA FROM AN EXPERIMENT ON THE EFFECT OF DIFFERENT CONCENTRATIONS OF 
NICOTINE SULPHATE ON Drosophila melanogaster. DETERMINATION OF 
EMPIRICAL PROBITS 


а) (2) (3) (4) (5) (6) 
Nicotine 
Sulphate, Number Number Logio Empirical 
gm/100 сс of Insects Killed % Kill (Conc. x 100) Probit 
n r P X Ya 
0.10 137 23 16.8 1.00 4.04 
0.30 152 95 62.5 1.48 5:32 
0.50 146 119 81.5 1.70 5.90 
0.70 154 141 91.6 1.85 6.38 
0.95 152 144 94.7 1.98 6.62 
TABLE 18-2 
CALCULATION OF 7?—DATA OF TABLE 18-1 
(1) (2) (3) (4) (5) (6) (7) (8) 1 
(г — nPY 
X Y: P n 7 nP r— nP АБР) 
1.00 4.02 16.4 137 23 22.5 + 0.5 0.013 
1,48 5.29 61.4 152 95 93.3 АЗ 0.080 
1.70 5.87 80.8 146 119 118.0 + 1.0 0.044 
1.85 6.26 89.6 154 141 138.0 + 3.0 0.627 
1,98 6.61 94.6 152 144 143.8 + 0.2 0.005 
x% 0.769 
TABLE 18-3 
CALCULATIONS— DATA OF TABLE 18-1 
X n X w nw nwX 


1.00 137 4.0 0.439 60.14 60.14 
1.48 152 5.3 0.616 93.63 138.57 
1.70 146 5.9 0.471 68.77 116.91 
1.85 154 6.3 0.336 51.74 95.72 
1.98 152 6.6 0.238 36.18 71.64 


Total 310.46 482.98 


(пу Х?) = 782.8998 
С.Т. = 751.3679 


X(nwx?) = 31.5319 
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7. Example 18-2. Fitting a probit regression line by the method of 
maximum likelihood. The data of columns (1) to (5) of Table 18-4 were 
obtained from Dr. W. S. McLeod, being part of a published study [9] on 
refinements in the technique of testing insecticides. They represent 
percentage kill in batches of Drosophila treated with nicotine sulphate. 
The calculations for fitting a probit regression line by the method of 
maximum likelihood are enumerated below. It is assumed that the 
student is familiar with the methods of Example 18-1. 

1, Enter data as in columns (1) to (5) of Table 18-4. Obtain empirical 
probits from the tables by Fisher and Yates, and enter in column (6). 

2. Draw Figure 18-3, showing empirical probits plotted against Х. 
Set up a straight line to give as good an eye-fit as possible. Then read 
off Y, on the probit scale to one decimal place and enter in column (7). 

3. Obtain weighting coefficients from Fisher and Yates’ Table XI or 
from Finney's Table II. 

4. Calculate and enter nw in column (9). 

5. Calculate working probits from Fisher and Yates’ Table XI and 
enter in column (10). (Finney’s Table IV is much more rapid than 
Fisher and Yates’ Table XI and if available should be used.) The 
equation for the working probit is 


Р}. р 
Ү-|%-- = 
s ( > zl n 
Therefore for Y, — 4.2 and p — 0.233 


Y, = 3.4687 + 0.233 x 3.4519 = 4.27 


It will be noted that the working probits in this example are identical with 
the empirical probits. This is not always the case, although it is to be 
expected that they will be closer to the empirical probits than to Yo. 

6. Calculate the entries for columns (11) and (12), and sum columns 
(9), (11), and (12). From these calculate 


1 г. U(nwX) Y X(nwY;) 
X(nw) X(nw) ` E(nw) d 
7. Calculate 
У(пух?) = X(nwX?) — Ruzo 10 
X(nw) 
Хпиху)) = Х(пуХУ;) GEOG Y) 11 
X(nw) 

М X(nw Y) 
(тиу?) = X(nwY?) | -— 12 

(лу. 2 
22 = ر2‎ — È] 13 


Х(тух?) 
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In this example 7? = 24.91, and on referring to Table А-4 we note that 
this is a significant value for 4 degrees of freedom. Heterogeneity is 
obvious. 

8. Compute 

Х(пиху) 312.86 


^ S(nwx®) 18310 


2.644 14 


and if heterogeneity were absent we would have 


1 1 


V,————-—L-0 
è? Ул?) 118.31 uius 


S, = У/0.008 452 = 0.0919 
but, heterogeneity being significant, we have 


2 2491 
= ==" = 6228. 
HO DER О? 
Therefore.the corrected variance is 
0.008 452 x 6.228 — 0.052 639 
and xs 
8, = V0.052 639 = 0.2294 
Since г = 2.78 at the 5% point for 4 degrees of freedom, the fiducial 
limits of b are obtained from 
2.644 + 2.78 x 0.2294 = 2.644 + 0.638 
giving limits of 2.01 and 3.28. 
9. From the regression equation 
Y= Y + b(m— X) 15 
putting Y = 5, we get 
5—0 , 5-5.0661 


+ 1.0273 = 1.0023 
b 2.644 


m 


The dose in original units of concentration expected to give a 50% kill 
is therefore (antilog 1.0023)/5 — 1.005. 

Since 7? indicates significant heterogeneity, it is necessary to take this 
into account in determining the fiducial limits. Also in this example we 
shall apply the formula for the more exact fiducial limits. 
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First we have и, as calculated in step 8, which must be introduced in 
order to take care of heterogeneity. Also we must obtain 


Pu 2.776 x 6.228 
E = Sawe) — 2.644 x 118.31 


0.0580 16 


Then the exact fiducial limits are given by 


hm neat [E CREER, 
сеу то К тшу] T 


which for purposes of calculation is most conveniently broken down into 
three parts. 


0.0580 

0.9420 
t _ 2716 3 

b(1— g) 2.644 x 0.9420 


nr (m— Х)- 1.0023 + (1.0023 — 1.0273) = 1.0038 
E 


1.1146 


See ( SENT BUT 5 
A Е g | (т 2] = at (1.0023 E] 6.208 
X(nw) — X(nwx?) 3827 118.31 


— 0.039 53 


Finally 
1.0008 + 1.1144 x 0.03953 = 1.0008 + 0.0441 


giving fiducial limits of 1.045 to 0.957 in dosage terms and 1.11 to 0.91 
in actual concentrations. 

The procedure and methods for calculating fiducial limits of m may be 
summarized as follows. 

a. No heterogeneity (7? not significant). 


(i) g small with respect to unity, where 


Pu 


8 D?Y(nws?) 
Limits given by 


D Fm] 


(ii) g appreciable with respect to unity. 


(taking / = 1.96) 


Limits given by 


58 23 1 [=g , (m— Xy 
70 = (Жаа DS 5 s j d 
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b. Significant heterogeneity. 


Calculate 
Wa а.) 
pre Же PZ (nwx2) 
Limits given by 
Mg 1: t ! —g (т- X ір 
AD 1—$ imr ey s b(1— g) / Х(пу) + X(nwx?) 
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At this point the student might refer to Section 10 on Experimental 
Design with a view to assessing the design of the experiment just analyzed. 


55[- 


Probit kill 
e 
o 


4.5 


12 17 18 1.9 20 2.1 22 23 


X = log,, (concentration x 100) 


Ficure 18-3. Fitting of probit regression line to data 
of Table 18-4. 


It will be noted that the dosage range is very good in that the percentage 
kills range from 23 to 75%. Also, the number of insects in each batch 
is quite satisfactory. In this experiment there is no control level and 
hence no measure of natural mortality. This might be considered a 
weakness in the design, but it is of course possible that the experimenter 
may have had natural mortality in good control and considered that a 
measure of it was not required. A significant feature of this experiment 
is the heterogeneity, and an examination of Figure 18-3 indicates a 
curvilinear trend. Possibly some relation between dose and dosage other 
than the logarithmic one tried would be better, and this point at least 


requires investigation. 
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TABLE 18-4 


DATA ON EFFECT OF DIFFERENT CONCENTRATIONS OF NICOTINE SULPHATE 
ON Drosophila melanogaster, FROM Мс1 кор [9] 


8 (9 (0 (1) (12) 


а) 0 (зу (0 6 © 0 ( 
5 БО ФЕ en 
aR 222 Eg zu 
50% ge 5% 2% 
5 шш FO oa 
X n r P Yo w пи Y, nwX nwY, 
0.50 0.70 1048 244 233 427 42 0.503 527 427 368.90 2,250.29 


0.75 0.88 1202 451 37.5 4.68 4.7 0.616 740 4.68 65120 3,463.20 
1.10 1.04 1039 503 484 4.96 5.1 0.634 659 4.96 685.36 3,268.64 
1.25 1.10 1102 638 57.9 5.20 53 0.616 679 520 746.90 3,530.80 
1.50 1.18 1034 736 712 5.56 5.5 0.581 601 5.56 709.18 3,341.56 
1.75 124 1167 880 754 5.69 5.7 0.532 621 5.69 77004 3,533.49 


3827 3931.58 19,387.98 

pa — 0.000 2613 X — 1.0273 Y, — 5.0661 

Уп 

XGA?) = 4157.33 X(mwXY, = 20,230.65 Х(тғҮ;?) = 99,073.76 
C.T. = 4039.02 C.T. = 19,917.79 С.Т. = 98,221.52 
118.31 312.86 852.24 
CT 082735) 
p= 1 


8. Example 18-3. Determination of relative potency of biological pre- 
parations. In the measurement of the effect of toxic substances, effective- 
ness of sera for the prevention of disease, and so forth, one of the common 
problems is to compare the potency of two or more preparations. In 
some experiments the preparations may be new ones, but the usual 
problem is that of comparing one or more new Preparations with a 
standard. In biological work of this sort it is vety difficult to obtain 
exactly similar conditions in experiments carried out at different times and 
with different insects or animals. The inclusion of a standard in an 
experiment is therefore a routine procedure. 

Irwin [8] quotes an example from Smith [12] in which two anti- 
pheumococcus sera are tested for the prevention of disease in mice. The 
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data are given in Table 18-5. The symbols « and f distinguish the sera, 
and for our purpose f can be considered as the standard. Note that the 
data are for percentage survival and not for percentage mortality as in 
previous examples. The calculations are conveniently carried out in the 
steps enumerated below. 

I. Set up the data as in the first four columns of Table 18-5 and calculate 
p (the percentage survival). Empirical probits are then found from 
Fisher and Yates’ Table IX, noting that for p = 0 the empirical probit 
is — 00. 

2. Draw Figure 18-4, locating the points corresponding to the empirical 
probits but neglecting —oo. On this graph draw two parallel lines giving 


7.0 


Probits 


05: 075 OS TIT ҮЗҮ ҚЫ 2719-2122 25 
X = log, (concentration x 10,000) 


Ficure 18-4. Probit regression lines for data of Table 18-5. 


a good eye-fit. Then for each point on the graph find the corresponding 
point on the straight line and from these read off the provisional probits 
(Yo) on the probit scale. Enter these in column (7). For the empirical 
probit — оо, note that the figure is entered at X = 0.64 for « and X = 1.24 
for 6. 

3. Enter weighting coefficients in column (8) from Fisher and Yates’ 
Table XI. 

Calculate and enter nw in column (9). 

4. The working probits of column (10) are obtained from Fisher and 
Yates’ Table IX. For Y, = 2.7, Y, = 2.32 from the formula for а 
minimum working probit; that is, Y, = Y—P/Z. For Y, = 3.6 we have 


ү = (x) + d = 3.0606 + 0.05 x 6.6788 = 3.39 
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5. Calculate and enter nwX and nw У, in columns (11) and (12). Then 
sum columns (9), (11), (12) separately for « and f, and calculate 1/Х(т»), 
Xand Т, 

6. Provisional values of Бу, m,, Mg, M,, and р„ may now be calculated. 


M,,=™m;—m, (relative potency in dosage) 
Pag = 1070 (relative potency in units of concentration) 


From the graph for « 


X, — 1.80 Y, — 6.01 
^ н pre d 
X, — 0.45 Y,—2:12 1.35 
Difference — 1.35 3.89 


From the graph for f 


X, — 240 Y, = 5.70 
i е bes ee 
X= 1.15 71-210 1.25 
Difference = 1.25 3.60 


The two estimates of b must of course agree if the two lines have been 
drawn parallel. 


Then, setting up the regression equation Y = Y, + b(X— X) for both 
and f and substituting m, and ту for X and 5 for Y gives 


a Y = 4.96 + 2.88(X — 1.39) = 0.96 + 2.88 X 


5 = 0.96 42.88m," m,— 404 


к=, 
т 5 


B Y = 4.87 + 2.88(X — 2.08) = — 1.12 + 2.88¥ 


6.12 
5 = 112+ 28m, ту = = 2.12 
М„ = 2.12— 1.40 = 0.72 
Pop = 102 = 52 


The provisional estimate indicates that 5.2 times the concentration of 
В is required to give the same protection as 1 unit of concentration of «. 
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7. Complete the calculations given below in order to obtain new 
estimates of the parameters, and to test their significance and the goodness 
of fit of the new regression lines. 


ж B a+ В 
X(nwX?) = 160.1200 322.5672 
CT: = 151.4806 315.4643 
пиз) = 8.6394 + 7.1029 = 15.7423 


X(nwXY, = 566.2667 756.1954 
ӨЛІ -- 540.0853 739.8953 


X(nwxyj) = 26.1814 + 16.3001 = 42.4815 


X(nw Y,?) = 2009.2773 1777.4078 
GE = 1925.6069 1735.3626 


83.6704 42.0452 = 125.7156 


Regression = 79.3418 37.4063 


ye 4.3286 4.6389 = 8.9675 
DF = 3 Hd 6 
5% point = 7.815 7.815 12.592 


A complete analysis of y? can be made by calculating a total x? for 
7 degrees of freedom as follows: 


(42.4815)? 
= 125.7156 11.0768 
й 15.7423 
giving the analysis 
fa: DF MS 5% Point х 
Parallelism of regressions 2.1093 1 2.109 3.84 
Heterogeneity 8.9675 6 1.494 12.59 
Total 11.0768 7 1.582 14.07 


None of the y? values is significant at the 5% point although there is 
some indication of heterogeneity. This seems to be due to a tendency of 
the « regression to be somewhat curvilinear and to one point on the f 
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regression that deviates rather widely from the straight line. Тһе parallel- 
ism is also only reasonably satisfactory. Note that heterogeneity as 
measured here is a residual effect—that is, it represents total lack of 
agreement with the straight line within the two regressions. 

Due to some indication of heterogeneity the heterogeneity factor will 
be used in setting up fiducial limits although it is not expected that they 
will be greatly affected. 

8. Obtain new estimates of the parameters and set up fiducial limits. 


42.4815 
= 15.7403 = 2.6986 
8.967 
2.89005 = 2.2419 
u 2.2419 


0. 
O Утул?) 15.7423 ae 


5, = V0.1424 = 0.377 


Fiducial limits are 2.699 + 2.776 x 0.377, giving 3.75 and 1.65. Then 
from 


Y, = 4.9597 + 2.6986(X — 1.3911) = 1.2057 + 2.6986 X 
the value-of m, is calculated by putting m, = X, and Y, = 5. 


5— 1.2057 
т, = — Ve = 1.4060 


Similarly 


Yq = 4.8677 + 2.6986(X — 2.0754) = — 0.7330 + 2.6986 Y 
р m, 3250330 
2 ы 2.6986 
Тһеп 


= 2.1244 


М.в = 2.1244 — 1.4060 = 0.7184 
Pap = 10012 = 5.2 
The fiducial limits of М.р are given by 


£ t 
~ b(1— р) 


V — gXVr, + Vig) + X, Р) MPV, 


MFT (%— Ж.) 
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where g and the variances must in this example, be increased by the 
heterogeneity factor. We have to find: 


2.776 x 2.2419 17.2765 


-= = 0.1507 
8 = 2.69862 x 15.7423 114.6424 
2.2419 
Vg, = —— = 0.028 64 
Grae 17828 
2.2419 
————— c; 61 
Vrs = z374 = 0030 


V, = 0.1424 (step 8, above) 


g с 0.1507 


ӨЛЕ .7184 — 0.6843) = 0.006 051 
pag IM — و‎ 5) 07 ) 


t 2.776 


= 1.2112 
b(1— g) 2.6986 x 0.8493 


У(1- 8V rs + Vra) + X5 — X) — MPV, 


= v (0.8493 x 0.059 25) + (0.6843 — 0.7184)? x 0.1424 
= V0.050 49 = 0.2247 


Finally the fiducial limits are 


0.7184 + 0.0061 + 1.2112 x 0.2247 = 0.7245 + 0.2722 


giving 0.9967 and 0.4523. The corresponding relative potencies are 
102-99? — 9.93 and 109452 — 2,83. 

Owing largely to lack of homogeneity the final result in terms of 
relative potency of the two preparations is not very satisfactory. 

9. Other developments of probit analysis. The theory and examples 
given in this chapter are merely an introduction to probit analysis. The 
student who wishes to obtain a full treatment should refer to Probit 
Analysis by D. J. Finney (51. 

One feature not covered in this chapter but which may be of great 
importance is the treatment of data containing a check in which mortality 
occurs. This requires a correction throughout the data. The method 
in detail is given by Finney. 

Mortality-dosage experiments can be carried out in factorial design. 
For example an insect spray may be applied in two forms, each at two or 
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more concentrations. The analysis of the data involves fitting probit 
planes. 

Another interesting development is the application of graphical methods 
in order to reduce calculations toa minimum. А very complete graphical 
method involving nomograms for simplifying most of the calculations has 
been given by Wilcoxon and Litchfield [11]. These methods have con- 
siderable value when an appreciable number of analyses are to be made. 

These and many other phases of the probit analysis technique are 
available to the experimenter who wishes to take advantage of them. 


TABLE 18-5 


CALCULATION OF RELATIVE POTENCY OF Two ANTI-PNEUMOCOCCUS SERA 


а) о 0095 © 0 ® (9 010) (и) (12) 


бар 1 
oS iz Eg 
ше 2 E 
Sir 0 ДЕ 
a r A Me 32 Y, w nw Y, nwX nwY, 
« 0.0004375 0.64 40 0 -о 27 0076 304 232 1.946 7.053 
0.000875 0.94 40 2 5 336 3.6 0.302 12.08 3.39 11.355 40.951 
0.001 75 124 40 14 35 4.61 44 0.558 2232 4.63 27.677 103.342 
0.003 5 1.34 40 30 75 5.67 5.3 0.616 24.64 5.65 37.946 139.216 
0.007 1.85 40 34 85 6.04 61 0.405 1620 6.03 29.970 97.686 
78.28 108.894 388.248 
В 0.001 75 124 40 0 0 — o 24 0.040 1.60 2.06 1.984 3.296 
0.003 5 154 40 2 5 336 32 0.180 7.20 3.38 11.088 24.336 
0.007 1.85 40 14 35 461 41 0471 18.84 4.72 34.854 88,925 
0.014 2.15 40 19 47.5 494 5.0 0.637 25.48 4.94 54,782 125.871 
0.028 245 40 30 75 5.67 58 0.503 20.12 5.67 49.294 114.080 
73.24 152.002 356.508 
p pm Cos ын T1591] yu 
Sam 478628 eas ©2078 SRO uaa on ы 
1 = 152.002 " 
=—==~— = 0.013654 # = 2.0754 Y,— 4.8677 


Ули, 73240 в = 73240 
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10. Experimental design. After having worked through the examples 
and learned the fundamentals of the probit analysis method, the student 
is in a position to make some decisions as to how new experiments should 
be set up. Two questions normally arise in the minds of investigators 
planning experiments of this type. In the first place they ask: “What 
should be the range and number of concentrations?" And in the second 
place: “How many insects or animals, etc., should be treated at each 
level ?* 

The range of dose is obviously something that cannot be answered 
without some preliminary information, and in dealing with an entirely 
new substance it would seem to be desirable to get preliminary data on 
roughly approximate levels for the LD 10, LD 50, and LD 90. It should 
then be possible to set up a series of satisfactory dose levels for a well- 
controlled experiment, keeping in mind that the least information with 
respect to potency is given by the levels that give very low and very high 
percentage kills. The most valuable points are those giving kills from 
25 to 75%. Kills lower than 16% and higher than 84% give so little 
information with respect to the LD 50 that they can practically be dis- 
regarded. Their precision is so low that a much larger number of insects 
is required to estimate them with the accuracy ordinarily obtained in the 
region of the LD 50. 

The question of how many insects there should be in each batch is 
related to the number of batches, but in any event it is preferable with a 
given number of insects available to spread these over several batches at 
different levels rather than to have only 3 or 4 larger batches at 3 or 4 
levels. The general accuracy of the experiment is of course increased by 
increasing the number of insects in each batch. Perhaps certain published 
results of probit analysis with 20 to 30 insects in each batch have led 
investigators to think that a greater number is unnecessary. This is of 
course incorrect, and the greatest number should be used that does not 
make the experiment too expensive and unwieldly relative to the value of 
the information to be obtained. It must be recognized that in experi- 
menting with animals the numbers may have to be small. The chief 
point to remember, however, is that the probit method or any other type 
of analysis does not make up for lack of accuracy in the experiment or 
unsatisfactory techniques. 

In most experiments of the biological assay type it is important to have 
some measure of natural mortality. This is usually accomplished by 
having a control level. When mortality occurs at this level it is advisable 
to make a correction in the analysis. The method for this is described 
by Finney [5] in some detail. 
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11. Exercises. 


1. Table 18-6 gives data from Morrison [10] on results in applying different con- 
centrations of nicotine sulphate to batches of Drosophila of 150 flies each, 

Obtain a provisional regression line for the data and calculate 7* for goodness of fit. 

Estimate 6 and m from the graph. 


TABLE 18-6 
Concentration Insects Killed % Kill 
а р 
0.6 97 64.7 
0.8 120 80.0 
11 133 88.7 
14 137 91,3 
18 145 96.7 


2. Table 18-7 gives data from Morrison |10) іп ап experiment for comparing the 
effect of containers in the treatment of Drosophila with nicotine sulphate, A batch 
consists of 150 flies, but in the a series the flies were treated in 10 containers of 15 flies 
each. 


Complete the analysis of the data as in Example 18-3, obtaining Mag and its fiducial 
limits, 


TABLE 18-7 
а Series B Series 
Concentration % Kill Concentration % Kill 

0.6 314 06 64.8 

0% 397 0.8 79,8 

[ 56,6 11 88.6 

14 63.8 14 91.6 

18 777 18 96.8 
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CHAPTER 19 


Quality Control and Sampling 


for Inspection and Verification 


1. Definition of quality control. Quality control is considered here 
to be the maintenance of quality in a uniform flow of manufactured 
products. This is essentially what the term quality control has come to 
mean in recent years, but this does not imply that the same principles 
cannot be applied in other fields. Actually, as we shall see later, the 
principles of quality control can be applied wherever we are concerned 
with the establishment of uniform standards of quality in a continuous 
flow of products, or, in other words, where the units or parts of the flow 
are spaced consecutively in time. 

In quality control more emphasis is placed on the time element than in 
many other types of statistical applications. Of course we һауе time 
trends in economic and business statistics, and time is often a factor in 
research experiments, but our interest in such studies is usually to measure, 
interpret, and test the significance of the trends. In quality control the 
objective is to study the trend with sufficient accuracy to observe any 
deviations from an even course, and, when such deviations are observed, 
to take the action necessary to prevent future occurrences. 

2. The control chart. Although there are many phases of quality 
control, one of the basic applications is the use of control charts. To 
illustrate the control chart let us suppose that the product is bread and 
that loaves are produced in batches of 100 at a time. Any variable may 
be the subject of quality control, and there may be as many variables as 
we wish to measure, but for each a separate chart will be required. Let 
us suppose that the variable in this case is loaf weight. We chose this 
variable for the first example because it is easy to measure and does not 
involve destructive sampling. From each batch we form a sub-group or 
sample, the selection of loaves for the sample being purely random. The 
first problem to be solved is the number of loaves required. Since 
sampling is not destructive, the chief determining factors will be time and 
expense, but in any event the minimum number of loaves will be 2 as 
otherwise the control chart will be of no value. A very common practice, 
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unless there are very good reasons for choosing a different number, is to 
take 4 loaves. The weights of the individual loaves are determined and 
the mean ( X) and the standard deviation (5') calculated. We now set up 
two charts with time as the abscissas, and on one of these we plot means 
and on the other the standard deviations. The plotting of one such set 
of determinations is of no value, but we proceed to plot similar determin- 
ations as the batches come forward. Obviously, uniformity of product 
will be represented by a horizontal row of points and a widely deviating 
flow of products Бу a scatter. After plotting 10 to 20 batches a calculation 
can be made of what is known as control limits, the purpose being to 
study the relation between chance causes of variation and those due to 
definite assignable causes. 

Assuming that К samples of п each have been taken, we have the k 
means x, Xa, ++ +, X,, and the k standard deviations 571, S's, * - А 
where s’; is calculated from 


Г m 
= Em 
=) - 


On the chart for means we plot the mean of all samples given by 


X, H+ Mt: ER 


With respect to control limits for the mean it is customary to set these so 
that the probability of a mean’s going beyond them due to chance variation 
only is quite small. Shewhart [4] of the Bell Telephone Laboratories 
takes it as a general rule based on both theoretical and practical grounds 
that the limits should be 


Lower limit L = X,— 35's 
Upper limit „= X, 4352 


where 57 ү, the standard deviation of a mean of a sample of size п, is 


calculated as follows. 
ку ана ааа a 


sg = = 


Сә = 


where (22) 
А 2A 
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For practical calculations Shewhart [4] has tabulated a series of values of 
€». These are reproduced in Table А-5 in the Appendix. 

On the chart for standard deviations we plot the mean 5 and determine 
the limits from 


L,-—3 — 35, 
L,— 5 + 35, 
where 
um E 
% сұУ2п 


Whenever L, turns ош to be negative, it is taken as zero. 
We now examine Figure 19-1 taken from Pearson [2] in a monograph 
on industrial standardization and quality control. It represents a control 


m 
a 


кә 
о 


Percent ash content 


Sample number 


20 Fuel B 


Percent ash content 


35 40 45 50 


0 5 10 15 
Sample number 


Ficure 19-1. Control charts for mean ash content of coal delivered by two 

companies. Extracted with permission from Applications of Statistical Methods 

to Industrial Standardization and Quality Control, by Е. 8. Pearson, B.S. 600, 1935; 

reproduced by permission of the British Standards Institution, 24/28 Victoria 

Street, Westminster, London, S.W.1, from whom official copies can be obtained, 
Price 12/6d, post free. 
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chart for means taken by a consumer of coal who was receiving it from 
two companies. In this instance the means and standard deviations were 
calculated from consecutive pairs of loads delivered, so that n = 2. We 
note that for fuel A the limits of ash content are much narrower than for 
fuel B and only one sample falls outside the inner limits representing a 
probability of 0.025 and none outside the outer limits representing a 
probability of 0.001. 

The charts of Figure 19-1 show very clearly that fuel В was varying 
because of definitely assignable causes which could probably have been 
removed by attention to engineering details. 

At this point, it is worth while to return to an emphasis of the importance 
of the order in which the data are studied. To demonstrate this point 
Shewhart [5] gives a chart showing results for 204 observations in groups 
of 4 of the resistance in 10? megohms of pieces of insulating material. In 
the first instance the data are plotted as the pieces were obtained in pro- 
duction, and in the second instance the data for the 204 pieces were written 
on chips and drawn at random in groups of 4. These charts are repro- 
duced in Figure 19-2. 

It is obvious from Figure 19-2 that there is evidence of lack of control 
when the data are plotted in consecutive order, i.e., of the presence of 
definite assignable causes of variation. When the data are mixed together 
and drawn at random, the limits are closer and there is no evidence 
whatever of lack of control. The latter result is of course to be expected 
as the process of mixing the results and drawing at random has completely 
eliminated the distinction between chance errors and those arising from 
assignable causes. The entire purpose of the control chart is in this 
way defeated. This example brings out the essential importance of 
the time element as portrayed in a control chart. It is analogous 
to bringing another dimension into the representation of the data. 
Without this dimension the important characteristics of the data may 
be lost. 

Violent fluctuations in the means of samples as detected by the control 
chart show that the products are fluctuating in average quality and as a 
result consumers may be provided with lots that fall below a certain 
definite desirable standard. There is another factor in quality, however, 
which is frequently of considerable importance. This is the factor of 
variation itself, even when the level does not fall below a certain estab- 
lished standard. Information on variability is given by the control chart 
of standard deviations. In general, when the means fluctuate violently 
the standard deviations can be expected to fluctuate in a similar manner, 
but in certain cases the means appear to be in control whereas the standard 
deviations are not. This represents an undesirable condition with respect 
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Ficure 19-2. Control chart from Shewhart [5] for 204 observations in groups 

of 4, of the resistance in 102 megohms of pieces of insulating material. Upper 

chart represents data as actually taken from the production line. Lower chart 

represents data after mixing and drawing out groups of 4 at random. Reproduced 

from Statistical Methods from the Viewpoint of Quality Control, W. A. Shewhart, 
U.S. Dept. of Agriculture, Graduate School, 1939, 
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Ficure 19-3. Control chart for standard deviations, Data from Shewhart [5], as 
in Figure 19-2. Reproduced from Statisticai Methods from the Viewpoint of Quality 
Control, W. A. Shewhart, U.S. Dept. of Agriculture, Graduate School, 1939. 
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to variability of the product and here, again, is presumed to arise from 
assignable causes that can be located and removed. 

Figure 19-3 from Shewhart [5] gives the control chart for standard 
deviations for the data on 204 pieces of insulating material. Both the 
mean chart and the standard deviation chart showed lack of control. In 
this example the causes of lack of control were found and removed. 

In industrial applications it should be clear that statistical control may 
mean a very great saving in addition to the enhanced reputation that 
comes from the merchandising of a uniformly good product. When 
trouble starts it is of great economic importance to detect it immediately. 
This is one of the most important functions of the control chart. 

3. Summary of control chart technique. The control chart technique 
may be summarized as follows: 

1. Taking small samples from the production line and plotting the 
values of the means, standard deviations, ranges, or other statistics on a 
horizontal chart in which time is the abscissa. 

2. Setting up limits based on the pooled standard deviations of the 
samples on the assumption that this standard deviation is a fair repre- 
sentation of chance variability. 

3. Taking action to ascertain the nature of and to correct for assignable 
causes when means, standard deviations, or other statistics, fall outside 
the limits. 

The theory of the control chart is that differences between units in small 
samples, where the units occur close together in the time sequence, аге 
much more likely to represent chance deviations than would be the case 
if large samples were taken consisting of units spread widely along the 
production line. Assuming that a good approximation to the standard 
deviation representing chance variation has been made, the limits set up 
are such that, if the statistic charted wanders out of these limits, there 
are good reasons for attributing the variation to assignable causes. 

4. Example 19-1. Calculations for setting up a control chart. The 
data are on individual weights of eggs, 10 eggs being taken at random 
each day for 24 days.* The data are given in Table 19-1. This example 
is useful as an application of the control chart technique in an agricultural 
problem. Assuming that a flock of hens has reached the stage in its 
growth where the mean egg weight is not showing a distinct trend, the 
control chart method can be employed to indicate any decisive changes 
in conditions that may affect egg weight. This differs somewhat from a 
factory production problem in that seasonal trends will eventually enter 
in, but for the usual laying period during the winter months the chart 


* Data obtained by courtesy of the University of Manitoba, Poultry Department. 
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may be quite effective in giving an indication of the general condition of 
the flock. Of course other egg characters could be charted. Weight 
was taken in this example because it was an important but easily 
determined character. 

The calculations shown in Table 19-2 would ordinarily be computed 
in the order that the data are obtained. Each day the mean and standard 
deviation would be plotted on the control chart as in Figure 19-4. A 
preliminary check on the state of control might be made after a period of 


Jan.-Feb. 


Ex 
= 


Standard deviation 
oan о 


= 


FIGURE 19-4. Control chart for means and standard deviations for 
samples of 10 eggs each, taken daily. 


about 10 days. Here we have assumed that the trial has been operating 
for 24 days, and we wish to set up the limits of variation on the control 
chart which will then remain for another period of about 20 days, at 
which time the data are to be again summarized and new limits set up. 

Note that in calculating s' we apply the formula for large samples; 
i.e., the divisor is the number in the sample. This is in accord with the 
Tecommendations made by Shewhart [4]. In determining the 35' e limits 
for the mean, the bias in s' is removed by dividing by the factor с,. 

On studying the control chart of the means in this example it is clear 
that there is very little evidence of lack of control. The chart for standard 
deviations, however, shows 4 values that are beyond the upper limits. 
An examination of individual weights shows that on these days a single 
exceptionally large egg occurred in the sample. These are undoubtedly 
double-yolked eggs which are known to occur in practically all flocks but 
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probably with greater frequency under conditions where the hens are 
being forced beyond the capacity of their egg-laying apparatus. This 
might be regarded by the poultryman as a type of abnormality on which 
it would be valuable to have a check, and, although it might not be 
possible to eliminate the double-yolked eggs completely, if they seemed to 
occur too frequently action of some sort would be indicated, either by 
varying the diet or by eliminating the offending birds. 

Although this example seems to indicate that egg weight alone is of 
value for control purposes, it should be obvious that other egg characters 
could be used. With a fairly large flock it might be quite profitable to 


employ destructive sampling in which case quality factors could be 


accurately studied. A control chart would be set up for each character, 
and a complete check obtained on the output of the flock. Trouble 
would be observed at an early stage, and action taken before economic 
losses were involved. 

There is a specific problem to be considered in applying control charts 
to the output of animals or plants in that definite and normal seasonal 
trends may take place. Although it might be possible to set up normal 
trends and study the deviations from these, it would seem to be more 


TABLE 19-1 
Data ON EGG WkiGHTS—SAMPLES OF 10 EACH TAKEN ON 24 CONSECUTIVE DAYS 
Dec. 10 55 53 56 63 66 58 53 57 61 53 


11 59 62 56 51 61 75 57 60 55 74 
12 55 61 62 55 64 59 60 65 66 70 


13 74 44 60 63 56 51 63 55 58 68 
14 55 57 63 67 61 64 65 57 55 58 
15 62 62 54 59 61 56 54 68 57 55 
16 50 53 58 57 58 56 68 57 61 56 
17 61 58 61 50 60 61 54 63 69 53 
18 59 50 55 51 75 57 63 95 78 60 
19 64 62 55 53 60 55 56 71 60 57 
20 58 93 60 60 57 63 60 53 63 54 
21 56 57 58 62 55 57 63 53 55 64 


22 55 51 61 52 61 55 61 51 63 55 
23 60 64 57 58 66 59 57 61 56 63 


24 98 63 56 62 67 62 52 50 57 66 
25 64 66 64 65 55 67 62 59 54 51 
26 52 o0 СТІ 65 65 57 65 54 60 64 
27 64 61 69 61 54 55 60 62 54 60 
28 67 61 59 60 61 62 71 56 56 54 


29 65 65 63 63 60 55 54 56 55 60 
30 65 60 66 60 56 70 60 54 64 63 


31 58 64 58 64 103 65 64 67 51 60 
Jan. 1 64 63 64 55 57 64 55 63 56 49 
2 66 57 70 59 55 54 62 61 59 69 
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[19] 


practical to establish a new mean at frequent intervals. With egg pro- 
duction, for example, it might be advisable to operate with 10-day periods. 
The most efficient method would arise from actual experience and could 
be developed while the control chart work is in progress. 


TABLE 19-2 
CALCULATION OF MEANS AND STANDARD DEVIATIONS AND 35 LIMITS 


FOR CONTROL CHART 


Total Mean У(Х?) =x? s 
Dec, 10 575 57.5 33,247 184.5 4.30 
11 610 61.0 37,758 548.0 7.40 
12 617 61.7 38,273 204.1 4.52 
13 592 59.2 35,700 653.6 8.08 
14 602 60.2 36,412 171.6 4.14 
15 588 58.8 34,756 181.6 4.26 
16 574 57.4 33,152 204.4 4.52 
17 590 59.0 35,082 272.0 5.22 
18 643 64.3 43,159 1814.1 13.47 
19 593 59.3 35,425 260.1 5.10 
20 621 62.1 39,725 1160.9 10.77 
21 580 58.0 33,766 126.0 3/55. 
22 565 56.5 32,113 190.5 4.36 
23 601 60.1 36,221 100.9 3.18 
24 633 63.3 41,695 1626.1 12.75 
25 607 60.7 37,129 284.1 5.33 
26 599 59.9 36,089 208.9 4.57 
27 600 60.0 36,200 200.0 4.47 
28 607 60.7 37,085 240.1 4.90 
29 596 59.6 35,690 168.4 4.10 
30 618 61.8 38,398 205.6 4.53 
31 654 65.4 44,540 1768.4 13.30 
Jan. 1 590 59.0 35,062 252.0 5.02 
2 612 61.2 37,734 279.6 5.29 
Total 1446.7 147.13 
с 14467 = 
= EX 60.28 X + 35' e = 60.28 + 6.30 = 66.58 
, 14743 2 
= ORE 6.130 X — 35 c = 60.28 — 6.30 = 53.98 
ә 6.130 210 j 2 
Sy 0.9228 x 3.162 1101 5 + 35’ = 6.130 + 4.455 = 10.58 
7 6.130 
— 1.485 55-35, —5161301/4/455 = 1.68 


?* — 0,9228 x 4.472 


4 
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5. Sampling for inspection. The problems of sampling for purposes of 
inspection are much the same regardless of the nature of the material 
inspected. For this reason the inspection of lots of manufactured 
products for the detection of poor quality articles is essentially similar to 
inspecting herds of cattle for disease or fields of grain for the presence of 
off-types or other varieties. It appears to be generally true, however, 
that in the manufacturing process the efficiency of inspection procedures 
are more readily interpreted in terms of costs and savings. Consequently 
the techniques of sampling and inspection for factory outputs have been 
studied extensively and practical procedures have been worked out that 
are now being widely employed. The inspection of agricultural products 
has not been studied as extensively, and in many cases procedures are 
followed that are not the most efficient or the most economical. 

Although the principles of sampling for inspection purposes are 
essentially the same irrespective of the material inspected, in every 
inspection procedure there are certain basic statistical problems to be 
worked out. For example we must find the appropriate theoretical 
distribution of defective units. In considering the probability of obtaining 
a defective unit from a production line in which the proportion of 
defectives is 10%, we are obviously dealing with a binomial distribution. 
In seed sampling, if the proportion of foreign seeds is 1 in 10,000, the 
Poisson distribution is more appropriate. In other examples we may 
have a normal distribution, or a negative binomial, and so forth. Also 
we have to consider the cost of inspection, the importance of accuracy in 
inspection, and the value of the data provided by the inspection as a basis 
for control of quality by the producer. 

It is not possible, therefore, to take procedures developed in a factory 
for inspecting lots for defective parts and apply them directly for example 
to inspection of fields of seed grain. In all cases there is preliminary 
work to be done and data collected before an adequate sampling and 
inspection program can be developed. 

Since the greatest developments in sampling and inspection have taken 
place in the control of quality of manufactured products, the principles 
can best be studied by a brief examination of the procedures developed in 
this field. 

6. Single sampling. This is the term applied to a method of sampling 
worked out by Dodge and Romig [1] of the Bell Telephone System. We 
should visualize in the first place a factory in continuous operation and 
articles of varying quality being produced. For convenience it can be 
assumed that the articles are being collected in lots of 1000, and the job 
of the inspectors is to draw a sample from each lot and decide whether 
or not the lot meets specifications as to their containing not more than a 
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given proportion of defectives. Ап efficient, but not necessarily the most 
efficient, method* from the standpoint of the average outgoing quality is 
to inspect every item and remove the defectives. This, however, auto- 
matically places inspection costs at a maximum, and if the average quality 
is high a great deal of this is wasted effort. On the other hand such 
great care might be taken in the factory as to render inspection unneces- 
sary, but this requires very specialized procedures and a large proportion 
of highly skilled workers, and the cost of the articles is increased to a 
point where all profits are absorbed. There is obviously a balance 
somewhere between production costs and inspection costs, for a given 
average outgoing quality, that will result in maximum economy. It is 
this balance which the statistician must work out. 

In single sampling the first step is to decide on the sample size and the 
acceptance number. The acceptance number is the number of defective 
units tolerated in a sample of a given size. These are worked out from 
data available on the average quality of the lots submitted, the specified 
tolerance percentage of defective units, and a specified risk of accepting 
a lot having the tolerance percentage of defectives. For example, the lots 
may contain 1000 articles and the specified tolerance limit is 5%. The 
average quality of lots submitted is 1%, and it may be agreed that for 
those lots having exactly 5% defectives the probability of acceptance is 
1/10. The latter is defined as the consumer's risk. Since the sampling 
of the lot is without replacement, the hypergeometric series provides the 
mathematical basis for arriving at the sample size and the acceptance 
number for varying values of the three determining factors. These have 
been worked out by Dodge and Romig [1] in the form of tables and 
charts. In a specific example using the values given above the sample 
size is 130 and the acceptance number is 3. 

The procedure is then as follows. 

1. Inspect a sample. 

2. If the acceptance number is not exceeded, accept the lot. 

3. If the acceptance number is exceeded, inspect the remainder of the 
ot. 

It is obvious that if there are very few poor lots it will rarely be 
necessary to make a complete inspection, and inspection costs are kept 
at the minimum. If there are many poor lots, inspection costs will 
increase, but in any event the probability of a consumer's obtaining a poor 
lot is not increased. 


The consumer's risk has already been defined. In addition there isa 


* Heavy routine may lead to errors, and inspectors generally are more alert and 
accurate when dealing with samples, 
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producer’s risk which arises from the probability that a lot that is better 
than the tolerance limit will be rejected and therefore must be inspected 
in detail. The relation between the consumer's risk and the producer's 
risk is illustrated іп Figure 19-5, taken from Dodge and Romig [1]. In 
this case the process average is 0.45%, the lot tolerance is 3%, and the 
consumer’s risk is 1%. Notice that the distribution having a mean of 3% 


50 
ےر‎ Process average р = 0.45% 
І 


| |g Acceptance line 


Samples of 170! Lot шеше 
from 0.45% р, = 

E defective сет : 

product 


Samples of 170 
"^ from a 3% defective 
lot of 1000 pieces 


Consumer's 
risk 
0.10 


Probability of finding d defects in sample 


0 1 2 3 4 5 6 7 8 9 10 511. 210518 
Number of defects іп sample, 4 
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Percent defective in sample 


FiGURE 19-5. Illustrating consumer's risk and producer's risk. 
From Dodge and Romig [1], with permission. 


overlaps the distribution having a mean of 0.45%. This gives a graphical 
picture of the two types of risks. 

7. Double sampling. Experience has shown that a more complex 
method than that of single sampling, wherein the results of the sampling 
procedure more closely controls the extent of the sampling required, may 
be more efficient. This has resulted in the method described by Dodge 
and Romig [1] as double sampling. The basic principles are the same 25 
in single sampling in that the same considerations enter into the design. 
However, the procedure as illustrated in Figure 19-6 is different. 

Note first that two numbers c, and c; аге set up for the allowable 
number of defectives in a sample. If the number in the first sample 15 
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су or less, the lot is accepted without question. If the number is c, or 
greater, the lot is presumed bad and all articles must be examined. When 
the number is between с, and cy, a second sample is examined from which 
a final decision is made as to acceptance or complete inspection. 
Allowing the results of the first sample to influence the extent of 
sampling is essentially sound in principle when minimum inspection costs 


Inspect a first sample 


of n, pieces 


If the number of defects 
found in the first sample 


Does not Exceeds Cı, Exceeds С, 
exceed С, but does not Е, 
exceed Cy 


Inspect a second sample 
of na pieces 


If the number of defects 
found in the first and 
second samples combined 


Does not Exceeds C; 
exceed Cz 


Inspect all the pieces in the 
remainder of the lot, and 
correct or replace all 

defective pieces found 


Figure 19-6. Procedure of double sampling. Reproduced with permission 
from Dodge and Romig [1]. 


are an important feature of the sampling process. This principle is 
carried to the limit of its application in the next method to be described. 

8, Sequential Sampling. In this method the principle of allowing the 
results obtained to determine the extent of the sampling is carried to its 
logical conclusion, i.e., to the point of allowing the results obtained to 
determine exactly the extent of the sampling required to reach a decision. 
The method has been developed largely by the late Professor Abraham 
Wald of Columbia University, and the theory is summarized in his book, 
Sequential Analysis [6]. In the introduction to this book Prof. Wald states: 
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Sequential analysis is a method of statistical inference whose 
characteristic feature is that the number of observations required by the 
procedure is not determined in advance of the experiment. The decision 
to terminate the experiment depends, at each stage, on the results of the 
observations previously made. The merit of the sequential method, as 
applied to testing statistical hypotheses, is that test procedures can be 
constructed which require, on the average, a substantially smaller 
number of observations than equally reliable test procedures based on 
a predetermined number of observations. 


In the development of the sequential theory as applied to sampling it 
is assumed that Х is a random variable that can take only the values 
Oorl. For example, we can suppose that X = 1 when ап item examined 
is defective and X = 0 when the item is satisfactory. 

Let p = the probability that X takes the value 1. Then the problem 
is one of testing the hypothesis that p does not exceed some specified 
value p’. In sampling, the value of p’ is selected such that we wish to 
reject a lot when р > p' and accept the lot when p <p’. For example 
p' in sampling is the tolerance limit, say 1/10 = 0.1. If p> 0.1, we 
should reject the lot. If p < 041, we should accept. We then select 
2 levels p, and po, where ро represents a low proportion of defectives and 
pı a high proportion, and the values of py and p, are such that it is a serious 
matter to reject a lot in which the proportion is py and a serious matter 
to accept a lot for which the proportion is ру. The problem is one first 
of selecting values for py and p; and second of assigning probabilities, say 
а of accepting a lot for which the proportion defective is ру, and say p 
of rejecting a lot for which the proportion defective is Po. Assigning 
values to po, р, % and f is not a statistical problem. It is one of engin- 
eering, establishing manufacturing costs, and retaining customer good 
will. Having assigned these values, formulas are given by Wald [6] for 
calculating the acceptance number л„ and the rejection number n, when 
a given sequential procedure of sampling is followed. These are 
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where m is the number in the sample. These rather heavy-looking 
formulas are actually quite easy for calculation and can be simplified by 
putting 

cm ts 
b ^ V deae ge Pi 
1—@ g Po 1— ро 


B= 


Then 
. log B— mlog rs 


Zum. log rı — log rs 
= ті 
FE log A — m log rs 4 


log rı — log rs 


In order to apply these formulas it is necessary to make up a table for 
each value of m showing the corresponding values of n, and n, required to 
accept or reject. It is more convenient, however, to follow a graphical 
method wherein the cumulative results of examining each unit are marked 
on a graph and acceptance or rejection decided in accordance with the 
graph passing over the lower or upper of two parallel lines. То construct 
the graph we require 
log B 


T log rı — log rs 


S log A 
! log r, — log rz 


ES log ro 
log r; — log r 


— log Blog A 


Bins ыд 
OST log r, log ra 


where ho and / govern the position of the parallel lines, S the slope, and 
Бүл) gives а measure of the maximum number in the sample that will be 
required in order to reach a decision. 


We shall construct the graph for the values po — 0.1, p, = 0.3, 


æ = 0.02, and f = 0.03. This graph is shown in Figure 19-7. 
First we find 


В = 0.03061 log B = 1.5141 

= 48.5 log A = 1.6857 
= 3 log r, = 0.4771 
гә = 0.7778 log rą = — 0.1091 
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Then 
— 1.5141 — 1.5141 

h E 

10 = Damm + 0.1091 ^ 0582 29 
1.6857 

ШЕГЕСІ? — 788 
0.1091 

“ose se 
1.5141 x 1.6857 2.5523 
AO eee 


0.4771 x 0.1091 0.0521 


It should be pointed out that the formulas given in (1), (2), (3), and (4) 
are based on the binomial distribution. In other words, it is assumed 
that, with p = fraction defective апа (1— p) = 4, the fraction not 
defective, the distribution of defectives in samples of size n is given by 
the expansion of (q + p)”. When p is quite small the Poisson distribution 
is more suitable than the binomial, but different formulas for Лу, hy, and 
Е (п) are required. These are given below.* 
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where то and m, are the means of the Poisson between which we want to 
distinguish and /, represents the natural logarithm. 

These formulas can best be understood by an application to a practical 
example. Suppose that seed samples are being drawn and the problem 
is to reject lots having a weed seed content of 3 or more per pound and 


* Worked out and supplied by the courtesy of G. B. Oakland, Science Service 
Branch, Department of Agriculture, Ottawa. 
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to accept lots having a content of 1 or less per pound. Then m, = 1 
and m, = 3. The probabilities of acceptance and rejection have to be 
determined arbitrarily as this is a commercial rather than a statistical 
problem. For illustration we shall take х = 0.1 and B = 0.1. 
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The graph is shown іп Figure 19-8. Before calculating the values for 
the graph it is useful to determine Л, and ñ, the number of drawings 
required on the average to reach a decision, first when the lot contains 
1 weed seed or less per pound and second when the lot contains 3 weed 
seeds or more per pound. These are given by 
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It would appear that 2 one-pound samples will be sufficient on the average 
to obtain a decision. If a decision is not reached after taking 3 x 1.95 
samples = 6 approximately, the rule is. іо stop the sampling procedure 
and take the mean of all samples drawn. If this is closest to mp the lot 
is accepted, and if closest to m; the lot is rejected. 
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FIGURE 19-8. Sequential sampling diagram for distinguishing 
between lots having zu = 1 weed seed per pound and lots 
having m, = 3 weed seeds per pound. ж = 0.1, Û = 0.1. 
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9. Verification trials. Fundamentally the verification trial as applied 
to registered plant stocks consists in taking a random sample of seed 
from a given lot, placing this sample in a growing test, and making counts 
to obtain an estimate of the proportion of off-types іп the lot. This type 
of test differs from seed sampling or field inspection in that the number 
of plants grown must be determined in advance, and only 1 sample can 
be taken. The statistical problem is one of deciding on the size of 
sample required based on the tolerance limit for off-types and the accuracy 
with which estimates should be made. 

Tolerance limits differ according to the crop. A few of the tolerance 
limits for off-types of different crops are given below, based on standards 
for foundation stock in the Canadian Seed Growers' Association. 


Cereals 1 in 20,000 
Cucumber lin 1,000 
Beans (vegetable) lin 500 
Carrots lin 10 


We note that there is a very wide range, but for cereals and the most of 
the vegetable crops the tolerance limits are such that, in samples taken 
from lots in which the proportion of off-types is in the range of the 
tolerance limit, the distribution will be of the Poisson type. 

Defining a sample unit as the number in which the tolerance limit is 
1 off-type, the basic Poisson distribution for verification trials of 1 
sampling unit is given in Figure 19-9. It is used in this manner. Suppose 
that the tolerance limit of off-types is 1:1000. If we take a large number 
of samples of 1000 each from a lot in which the proportion of off-types 
is exactly 1: 1000, the chart shows that in 36.8% of these samples we will 
expect to get 0 off-types, in 36.8% 1 off-type, in 18.4% 2 off-types, and so 
forth. Notice that in 73.6% of the trials we will get samples coming 
within the tolerance limit. This means that, if 1000 farmers submitted 
samples for verification trials all from lots containing exactly the tolerance 
limit, it would be expected on the basis of random sampling and taking 
1 sampling unit from each lot that 74% of the lots would be accepted and 
26% rejected. 

Suppose now that in the verification trials 2 sampling units are used. 
For example, in verifying cereals we would grow 40,000 plants, and for 
cucumbers, 2000 plants. The distribution of off-types would be as in 
Figure 19-10. 

We note now that 2 off-types is the tolerance limit because we are using 
2 sampling units. Totaling the first 3 columns, we have 67.7% as the 
percentage of lots that would be accepted and 32.3% that would be 
rejected, provided that all lots sampled contained exactly the tolerance 
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FIGURE 19-9, Basic Poisson distribution for verification trials 
of 1 sampling unit. 
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FIGURE 19-10. Basic Poisson distribution for verification trials 
of 2 sampling units. 
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limit. For other numbers of sampling units the data сап be summarized 
as in Table 1923. This table reflects merely the greater accuracy attained 
with larger numbers of plants. Since the increase in accuracy is very 
slow after reaching 5 sampling units, it would seem that there is not а 
great deal to be gained in going beyond this point. 

A more valuable point of view in deciding on the number of sampling 
units required is to determine limits ро and p, for the proportion of off- 
types in lots from which samples are taken such that if the lot contains 


TABLE 19-3 


PERCENTAGE OF SAMPLES ACCEPTED OR REJECTED ACCORDING TO 
NUMBER OF SAMPLING UNITS 


Number of 
Sampling Units % Accepted % Rejected 
1 73.6 26.4 
2 67.7 32.3 
3 64.7 35.3 
4 62.9 374 
5 61.6 38.4 
6 60.6 394 
7 59.9 40.1 
8 59.2 40.8 
9 58.7 . 41.3 
10 58.3 41.7 


the proportion рр it will be rejected in only a certain small percentage of 
the trials and if the proportion is p, the lot will be accepted in only a 
certain small percentage of the trials. 

This can be done by examining the range of Poisson distributions for 
different proportions of off-types. For example, suppose that we set the 
probability of acceptance of a poor lot at P = 0.10; what we want to 
know is just how poor this lot can be in order to have 1 chance in 10 of 
being accepted, with different numbers of sampling units. With 1 
sampling unit the lot is accepted when 1 or 0 off-types are found. We 
then determine the Poisson distribution with a value of m such that the 
0 and 1 columns of the histogram contain 10% of the total area. Тһе 
value of m for this distribution corresponds to p,, the proportion in the 
lot such that it has only 1 chance in 10 of being accepted. With 2 
sampling units the lot is accepted when 2, 1, or 0 off-types are found. 
The corresponding Poisson distribution required is that in which the 0, 1, 
and 2 columns add to 10% of the total area. In a similar manner p, 
values can be found for any number of sampling units. 
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The values of py are found in a corresponding manner. With 1 
sampling unit a lot is accepted when either 0 or 1 off-type is found, and 
in this case a distribution must be selected such that 90% of the area falls 
in the columns 0 and 1. 

Figure 19-11 results from an examination of the Poisson distributions 
for 3 probability levels, P = 0.01, 0.05, and 0.10. The A curves show 
for different numbers of sampling units the lot values (p) such that the 
chances of acceptance are 0.01 on the 4, o; curve, 0.05 on the Ао ; curve, 


8 


ч 


Off-types per sampling unit in supply 


05 Вз ДБ. 738 45 bb 65. 75 85 
Number of sampling units 


FIGURE 19-11. Limits of accuracy of verification trials with different 
numbers of sampling units. 


and 0.10 on the Ау curve. For example, if 2 sampling units are taken 
for the verification of a lot containing an average of 3.1 off-types per 
sampling unit, the probability of acceptance as read on the 4, o; curve is 
0.05, and conversely the probability of rejection is 0.95. The A curves 
therefore are thought of in terms of the probability of acceptance of lots 
that are poorer than the tolerance limit. The R curves show the proba- 
bility of rejection (po) of lots that are better than the tolerance limit. For 
example, a lot containing an average of 0.04 off-types per sampling unit 
will have only 1 chance in 20 of being rejected. 

The curves show also the limits of accuracy within which the verification 
trials with different numbers of sampling units are determined. For 
example, with 3 sampling units it can be said with a reasonable degree 
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of certainty that the stocks accepted do not contain more than 2.58 
off-types per sampling unit and those that are rejected do not contain less 
than 0.45 off-types per sampling unit. In addition, Figure 19-11 sum- 
marizes the information required for selecting sample sizes in verification 
trials. For most verification work the Apo; and 4:0 curves are probably 
of the greatest importance. These curves enable us to arrive at a very 
simple conclusion with respect to all such trials. There is a very rapid 
improvement in accuracy as the sampling units are increased from М, to 
2, but beyond 2 the improvement is slow. It would seem that a general 
conclusion can be drawn to the effect that 2 sampling units are the mini- 
mum for verification trials and that beyond 4 sampling units the increase 
in accuracy is not sufficient to warrant the additional labor involved. A 
safe working rule would be to take 3 units for an average testing procedure, 
to reduce this to 2 units when the work involved is heavy, and to increase 
it to 4 units whenever conditions are such as to make this possible. 


10. Exercises. 


1. Make a control chart from the first 5 columns of the data of Table 19-1, calcu- 
lating the required values from the results for the first 10 days. Plot the complete 
data for the means and standard deviations on this chart. 

2. Set up a sequential sampling chart for ро = 0.05, p, = 0.2, х = 0.05, В = 0.10. 

3. By coloring seeds to represent impurities, make up а 10-pound sample containing 
approximately 10 impurities per ounce. Letting my = 1, m, = 5, « = 0.05, В = 0:1, 
calculate and set up the sequential sampling graph, and then inspect the 10-pound 
sample by drawing 1-ounce samples in succession until a decision is reached. 
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TABLE A-1 
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AREAS, 4(1 + о) OF THE NORMAL CURVE IN TERMS OF THE NORMAL DEVIATE 


.00 


.5000 
.5398 
.5793 
.6179 
.6554 


.6915 
41257 
1580 
.7881 
.8159 


.8413 
.8643 
.8849 
.9032 
.9192 


.9332 
9452 
9554 
9641 
9713 


9772 
.9821 
.9861 
.9893 
.9918 


.9938 
.9953 
.9965 
.9974 
.9981 


.9987 
.9990 
.9993 
.9995 
.9997 


.9998 
.9998 
.9999 
.9999 


01 


.5040 
.5438 
.5832 
.6217 
.6591 


.6950 
7291 
7611 
7910 
.8186 


.8438 
.8665 
.8869 
:9049 
:9207 


9345 
19463 
9564 
9649 
9719 


9778 
.9826 
.9864 
.9896 
.9920 


.9940 
.9955 
.9966 
.9975 
.9982 


.9987 
.9991 
.9993 
.9995 
.9997 


.9998 
.9998 
.9999 
.9999 


.02 


.5080 
5478 
.5871 
.6255 
.6628 


.6985 
7324 
1642 
7939 
8212 


8461 
.8686 
.8888 
.9066 
.9222 


19357 
9474 
.9573 
.9656 
9726 


.9783 
.9830 
.9868 
.9898 
.9922 


.9941 
.9956 
.9967 
9976 
.9982 


.9987 
.9991 
.9994 
.9995 
.9997 


.9998 
.9999 
.9999 
.9999 


.03 


.5120 
5517 
.5910 
.6293 
.6664 


7019 
21397: 
7673 
7967 
.8238 


.8485 
.8708 
.8907 
.9082 
.9236 


9370 
9484 
9582 
.9664 
9732 


.9788 
.9834 
.9871 
.9901 
.9925 


-9943 
.9957 
.9968 
-9977 
.9983 


.9988 
.9991 
.9994 
.9996 
.9997 


.9998 
.9999 
.9999 
.9999 


04 


.5160 
«5557 
.5948 
.6331 
.6700 


7054 
.7389 
7704 
1995 
.8264 


.05 


.5199 
5596 
.5987 
.6368 
.6736 


-7088 
11422 
7734 
.8023 
.8289 


.8531 
.8749 
.8944 
9115 
.9265 


.9394 
.9505 
.9599 
9678 
9744 


9798 
9842 
.9878 
.9906 
.9929 


.9946 
.9960 
.9970 
.9978 
.9984 


.9989 
.9992 
.9994 
9996 
.9997 


9998 
9999 
.9999 
.9999 


7 


.5279 
5615 
.6064 
.6443 
.6808 


1157 
17486 
17794 
8078 
8340 


8577 
8790 
.8980 
9147 
.9292 


9418 
9525 
9616 
9693 
9756 


9808 
.9850 
.9884 
9911 
.9932 


.9949 
.9962 
.9972 
.9979 
.9985 


.9989 
.9992 
.9995 
.9996 
.9997 


.9998 
.9999 
.9999 
.9999 


.08 


.5319 
.5714 
.6103 
.6480 
.6844 


7190 
„7517 
7823 
.8106 
.8365 


.8599 
.8810 
.8997 
.9162 
-9306 


9429 
9535 
9625 
9699 
9761 


.9812 
.9854 
.9887 
:9913 
9934 


9951 
.9963 
.9973 
.9980 
.9986 


.9990 
.9993 
.9995 
.9996 
.9997 


.9998 
.9999 
:9999 
.9999 


.09 
.5359 
15153 
.6141 
.6517 
.6879 


7224 
7549 
7852 
.8133 
.8389 


.8621 
.8830 
.9015 
9177 
9319 


9441 
.9545 
.9633 
.9706 
9767 


9817 
.9857 
.9890 
.9916 
.9936 


.9952 
.9964 
:9974 
.9981 
.9986 


.9990 
.9993 
.9995 
.9997 
.9998 


.9998 
.9999 
.9999 
1.0000 


ORDINATES OF THE NORMAL CURVE IN TERMS OF THE NORMAL DEVIATE 


01 


.3989 
.3965 
.3902 
.3802 
.3668 


.3503 
3312 
3101 
2874 
2637 


.2396 
2155 
11919 
1691 
1476 


02 


.3989 
3961 
+3894 
3790 
3653 


3485 
3292 
.3079 
.2850 
.2613 


2371 
2131 
.1895 
.1669 
.1456 


.1257 
.1074 
.0909 
.0761 
.0632 


.0519 
.0422 
.0339 
.0270 
.0213 


0167 
0129 
.0099 
.0075 
.0056 


.0042 
.0031 
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.03 


.3988 
-3956 
.3885 
3778 
3637 


3467 
3271 
.3056 
.2827 
.2589 


2347 
2107 
.1872 
1647 
.1435 


.1238 
.1057 
.0893 
.0748 
.0620 


.0508 
0413 
.0332 


.3986 
3951 
.3876 
3765 
3621 


04 


.3448 
13251 
.3034 
.2803 
.2565 


.2323 
.2083 
.1849 
.1626 
1415 


1219 
11040 
0878 
0734 
.0608 


.0498 
0404 
.0325 
.0258 
.0203 


.0158 
.0122 
.0093 
.0071 
.0053 


.0039 
.0029 
.0021 
.0015 
0011 
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05 


3984. 
3945 
.3867 
3752 
3605 


3429 
3230 
3011 
2780 
2541 


2299 
.2059 
.1826 
.1604 
.1394 


.1200 
.1023 
.0863 
0721 
.0596 


0488 
0400 
0317 
10252 
.0198 


.0154 
0119 
0091 


.06 


3982 
.3939 
.3857 
3739 
3589 


3410 
.3209 
2989 
2156 
.2516 


2215 
2036 
.1804 
.1582 
.1374 


.1182 
.1006 
.0848 
-0707 
0584 


0478 
0387 
0310 
.0246 
.0194 


0151 
0116 
0088 
0067 
0050 


-0037 
.0027 
.0020 
0014 
0010 


07 


.3980 
.3932 
.3847 
3125 
3572 


3391 
3187 
2966 
2732 
2492 


2251 
2012 
.1781 
.1561 
.1354 


.1163 
.0989 
.0833 
.0694 
.0573 


:0468 
.0379 
0303 
.0241 
.0189 


0147 
0113 
0086 
.0065 
0048 


.0036 
.0026 
.0019 
.0014 
0010 


.08 


3977 
3925 
3836 
3712 
3555 


3372 
13166 
2943 
.2709 
2468 


2227 
.1989 
.1758 
.1539 
1334 


1145 
0973 
0818 
.0681 
.0562 


.0459 
.0371 
.0297 
.0235 
:0184 


.0143 
0110 
0084 
.0063 
.0047 


.0035 
.0025 
0018 
0013 
.0009 


.0007 
.0005 
.0003 
.0002 
.0001 


— 


TABLE A-3 
TABLE OF 1% 


Probability 
DF 0.50 0.10 0.05 0.02 0.01 0.001 


1.000 631 1271 31.82 63.66 63.7 

0.816 2.92 4.30 6.96 0102 5316 
765 2.35 3.18 4.54 5.84 129 
741 2.13 2.78 3.75 4.60 8.61 
.727 2.02 2.57 3.36 4.03 6.86 
118 1,94 2.45 3.14 3.71 5.96 
7и 1.90 2.36 3.00 3.50 5.40 
706 1.86 2.31 2.90 3.36 5.04 
703 1.83 2.26 2.82 3.25 4Л8 
.700 1.81 2.23 2.76 3.17 4.59 


Doo SAU» wR 


11 .697 1.80 2.20 2.72 3.11 4.44 
12 .695 1.78 2.18 2.68 3.06 4.32 
13 .694 1.77 246 2.65 3.01 4.22 
14 .692 1.76 2.14 2.62 2.98 4147 
15 .691 1.75 2.13 2.60 2.95 4.07 
16 .690 1579: 212 2.58 2.92 4.02 
17 .689 1.74 2.11 2.57 2.90 3.96 
18 .688 1.73 2.10 2.55 2.88 3.92 
19 .688 1.73 2.09 2.54 2.86 3.88* 
20 .687 1.72 2.09 2.53 2.84 3,857 
21 .686 1.72 2.08 2.52 2.83 3.82 
22 .686 1.72 2.07 2.51 2.82 3.79 
23 .685 1.71 2.07 2.50 2.81 3.77 
24 .685 1,71 2.06 2.49 2.80 3.74 
25 .684 1.71 2.06 2.48 2.79 3.72 
3 26 .684 1.71 2.06 2.48 2.78 3.71 
27 .684 1.70 2.05 2.47 2.77 3.69 
28 .683 1.70 2.05 2.47 2.76 3.67 
29 .683 1.70 2.04 2.46 2.76 3.66 
30 .683 1.70 2.04 2.46 2.15 3.65 
35 .682 1.69 2.03 2.44 2.72 3.59 
40 .681 1.68 2.02 242 2.71 3.55 
45 .680 1.68 2.02 2.41 2.69 3.52 
50 679 1.68 2.01 2.40 2.68 3.50 
2.00. 2.39 2.66 3.46 
2.00 2.38 2.65 3.44 
1.99 2.38 2.64 3.42 
1,99 2.37 2.63 3.40 
1.98 2.36 2.63 3.39 
1.98 2.36 2.62 2537! 
1.98 2.35 2.61 3.36 
1.97 2.35 2.60 3.34 
1.97 2.34 2.59 3.32 
157 2.34 2.59 3.32 
1.96 2.33 2.59 3.31 
1.96 2.33 2.58 3.30 
1.96 2:23 2.58 3.29 
q 
| 


60 .678 1 
70 .678 1 
80. 677 1 
90 677 1 
100 677 1 
120 676 1 
150 676 1 
200 675 1 
300 675 1 
400 675 1 
500 :674 1 
1000 .674 1 
oo .674 1 


2299599959399 


* The data of this table extracted from Table IV of R. A. Fisher's Statistical Methods 
for Research Workers with the permission of the author and his publishers, Oliver and 
Boyd, Edinburgh. 7 
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TABLE А-4 


TABLE OF 7?* 


Probability 

DF 0.99 0.95 0.50 0.30 020 0.10 005 002 001 0.001 
1 0.0002 0.004 0.46 107 1.64 271 3.84 541 6.64 10.83 
2 0020 0103 139 241 322 460 599 7.82 921 13.82 
3 0115 035 237 3.66 464 625 7.82 9.84 1134 1627 
4 0,30 0.71 3.36 4488 599 7.78 949 11.67 13.28 18.46 
М 0.55 114 435 606 729 924 11.07 13.39 15.09 20.52 
6 087 1464 5.35 7.23 8.56 10.64 12.59 15.03 16.81 22.46 
7 1.24 217 635 8.38 9.80 12.02 1407 16.62 18.48 24.32 
8 1.65 273 7.34 9.52 11.03 13.36 15.51 18.17 20.09 26.12 
9 2.09 3.32 8.34 10.66 12.24 14.68 16.92 19.68 21.67 27.88 
10 2.56 3.94 9.34 11.78 13.44 15.99 18.31 21.16 23.21 29.59 


11 3.05 4.58 10.34 12.90 14.63 17.28 19.68 22.62 24.72 31.26 
12 357 5.23 11.34 14.01 15.81 18.55 21.03 24.05 26.22 32.91 
13 411 5.89 12.34 15.12 16.98 19.81 22.36 25.47 27.69 34.53 
14 4.66 6.57 13.34 1622 1815 21.06 23.68 26.87 29.14 3612 
15 5.23 7.26 1434 17.32 19.31 22.31 25.00 28.26 30.58 37.70 


16 5.81 7.96 15.34 18.42 20.46 23.54 26.30 29.63 32.00 3925 
17 6.41 8.67 16.34 19.51 21.62 2477 27.59 31.00 33.41 40.79 
18 7.02 9.39 17.34 20.60 22.76 25.99 28.87 32.35 34.80 42.31 
19 7.63 1012 1834 2169 23.90 27.20 30.14 33.69 3619 43.82 
20 8.206 10.85 19.34 2278 25.04 28.41 31.41 35.02 37.57 45.32 


21 8.90 11.59 20.34 23.86 26.17 29.62 32.67 36.34 38.93 46.80 
22 9.54 1234 21.34 24.94 27.30 30.81 33.92 37.66 40.29 4827 
23 1020 13.09 22.34 26.02 28.43 32.01 35.17 38.97 41.64 49.73 
24 10.86 13.85 23.34 27.10 29.55 33.20 3642 40.27 42.98 51.18 
25 1152 14.61 2434 28.17 30.68 34.38 37.65 41.57 4431 52.62 


26 12.20 15.38 25.34 2925 31.80 35.56 38.88 42.86 45.64 54.05 
27 12.88 16.15 26.34 3032 32.91 36.74 40.11 44.14 46.96 55.48 
28 13.56 16.93 27.34 31.39 34.03 37.92 41.34 45.42 48.28 56.89 
29 14.26 17.71 28.34 3246 35.14 39.09 42.56 46.69 49.59 58.30 
30 1495 18.49 29.34 33.53 36.25 4026 43.77 47.96 50.89 59.70 


* The data of this table extracted from Table III of R. A. Fisher's Statistical Methods 
for Research Workers with the permission of the author and his publishers, Oliver and 
Boyd, Edinburgh. 


CORRECTION FACTOR с: FOR CALCULATION OF AVERAGE 
STANDARD DEVIATIONS 
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TABLE А-5 


€s 
0.72360 
0.79788 
0.84069 
0.86863 
0.88820 
0.90270 
0.91388 
0.92275 
0.92996 
0.93594 
0.94098 
0.94529 
0.94901 
0.95225 
0.95511 
0.95765 
0.95991 
0.96194 
0.96378 


n 
22 
23 
24 
25 
30 
35 
40 


Ca 
0.96545 
0.96697 
0.96837 
0.96965 
0.97475 
0.97839 
0.98111 
0.98322 
0.98491 
0.98629 
0.98744 
0.98841 
0.98924 
0.98996 
0.99059 
0.99115 
0.99164 
0.99208 
0.99248 
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orb 


ы TABLE A-6 


5% (ROMAN ТҮРЕ) AND 1% (BOLD-FACE ТҮРЕ) POINTS FOR THE DISTRIBUTION OF F* 


n, Degrees of Freedom for Greater Mean Square 
т 
Я : + { л 
1 2 3 4 5 6 7 8 9 10 | 11 | 12 | 14 | 16 | 20 (204 Fes "201720. 75 | 100 | 200 | 500 | © 
| | 
| | 
1 161 | 200 | 216 242 | 243 | 244 | 245 | 246 | 248 | 249 | 250) 251 252| 253 | 253 | 254 | 254| 254 
4052 | 4999 | 5403 6056 | 6082 6106 6142 | 6169 | 6208 | 6234 | 6258 | 6286 | 6302 | 6323 | 6334 | 6352 | 6361 | 6366 
| | 
2 | 18.51 | 19.00 | 19.16 | 19.39 19.40 | 19.41 | 19.42 | 19.43 | 19.44 19.46 | 19.47 | 19.47 | 19.48 | 19.49 | 19.49 | 19.50 | 19.50 
98.49 | 99.00 | 99.17 99.40 | 99.41 | 99.42 | 99.43 | 99.44 | 99.45 99.47 99:48 | 99.48 | 99.49 | 99.49 | 99.49 | 99.50 | 99.50 
3 | 10.13 | 955| 928 8.78 8.69 8.66 8.62| 8.60| 8.58 | 8.57| 8.56) 8.54 | 8.54 | 8.53 
34.12 | 30.82 | 29.46 27.23 26.83 | 26.69 26.50 | 26.41 | 26.35 | 26.27 | 26.23 | 26.18 | 26.14 | 26.12 
4| 771| 694| 6.59 | 5.96 5.84| 5.80 574| 5.71| 5.70| 5.68| 5.66| 5.65| 5.64 | 5.63 
21.20 | 18.00 | 16.69 14.54 | 14.15 | 14.02 13,83 | 13.74 | 13.69 | 13.61 | 13.57 | 13.52 | 13.48 | 13,46 
5| 661| 5.79) 5.41 474| 470| 4.68| 4.64| 4.60| 4.56 | 450| 446| 444| 442| 440| 438| 437| 436 
1626 | 13.27 | 12.06 10:05 | 9.96 9.89 9.77| 9.68| 9.55 9:38 | 929 | 924| 917 9.13| 9.07| 9.04| 9.02 
6| 599| 514| 4.76 406| 403| 400| 396| 392| 3.87 381| 377| 3.75| 3.72| 3.71| 3.69| 3.68 | 3.67 
13.74 | 10.02 | 9.78 7.87 | 779) 772 7.60| 7.52| 7.39 723 714| 7,09 | 7.02) 6.99| 6.94 6.90) 6.88 
7| 559| 4.74) 435| 4.12| 3.97| 3.87| 3.79| 3.73| 3.68| 3.63) 3.60) 3.57| 3.52| 349 3.44 3.38 | 3.34] 3.32| 3.29) 3.28] 3.25| 324| 323 
12,25 | 9.55) 8,45 7.85) 7.46| 7.19| 7.00 684| 671| 6.62 654 6.47| 6.35) 6.27| 6.15 5,98 | 5.90) 5.85) 5.78) 5.75| 5.70| 5.67| 5.65 
| 
8| 5.32) 446| 4.07] 3.84 | 3.69| 3.58) 3.50| 3.44| 3.39) 3.34| 3.31| 3.28| 3.23) 320| 3.15 3.08 | 3.05| 3.03 3.00) 2.98| 2.96) 2.94) 2.93 
11:26 | 8.65| 7.59| 7.01 | 6.63| 637) 6.19| 6.03 5.91| 5.82 5.4 5.67| 5.56 548 5.36 520| 511] 506 5.00) 496) 491| 4.88) 4.86 
9| 5.12] 4.26] 3.86] 3.63] 3.48| 3.37| 3.29) 3.23] 318| 313) 3.10| 3.07] 3.02| 2.98) 2.93 2.86 | 2,82) 2.80) 2.77| 2.76| 2.73] 2.72| 2.71 
10.56 | 802 699| 6.42| 6.06| 5.80| 5.62 5.47 5.35 5.26 5.18| 511 5.00 4.92) 4.80 4.64 | 4.56| 4:51) 445) 441| 436) 433| 431 
| 
10| 496| 410| 3:71 | 348| 3.33| 322| 314| 3.07| 3.02| 297| 294) 291) 286) 282| 2.77 270| 267) 2.64| 2.61] 2.59| 2.56 | 2.55| 2.54 
10:04 | 7.56 6.55) 5.99| 5.64| 5.39) 521) 5.06 4.95 4.85) 478 4Л1 4.60| 4.52 441 425| 417| 412) 405 | 4.01 | 3.96| 393 3.91 
11| 4.84| 3.98] 3.59| 3.36| 3.20) 3.09| 3.01| 2.95| 2.90) 2.86| 2.82| 279) 224| 2.70| 2.65 2.57| 2.53| 2.50| 2.47| 245| 242| 241| 240 
9.65| 7.20| 6.22 | 5.67| 5.52 5.07| 4.88| 474 4.63| 4.54 | 4.46| 4.40| 429 421| 4.10 394| 3.86 | 3.80| 3.74| 3.70| 366 3.62) 3.60 
12| 475| 3.88| 3.49| 3.26| 3.11| 3.00| 2.92) 2.85| 2.80| 2.76| 2.72| 2.69| 2.64| 2.60| 2.54 246| 242| 240| 236| 235| 232| 231| 230 
933) 693 5.95| 5.41| 5.06| 4.82) 4.65) 4.50 439) 4.30| 4.22| 416 4.05 3.98| 3.86 3.70 | 3.61| 3.56 | 349| 3.46) 341| 3.38 336 
13 | 467| 3.80| 3.41] 3.18] 3.02| 292) 2.84) 2.77| 2.72) 2.67| 2.63| 2.60| 2.55| 2.51) 2.46 238) 234| 232| 228| 226| 2.24] 222| 221 
907| 670| 5.74) 5.20) 486 | 4,62) 444| 430! 4.19 410) 402) 396) 3.85) 3.78) 3.67 351| 3422 337 330 3.27| 321| 3.18| 3.16 
| | 


* This table taken from G. W. Snedecor’s Statistical Methods, lowa State College Press, 4th ed., 1946, reproduced by permission of the author and his publishers. Calculated by 
G. W. Snedecor from Table VI of R. A. Fisher’s Statistical Methods for Research Workers. 
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ae tr M 35 а =з 2 д} 
в | 38 55 58 59 35 8% 29 5% Яң 8 Б кк $2 
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g| 28 se SE БЫ 58 54 93 33 82 ЕҢ EA Ба ET 
8| 15 55 du 95 52 24 313154 Sa 54 Sa <a 
oe GAS ET EDU E 
g | зв 23 з= $8 89 52 SS 29 z7 RA ER ES БЯ 
a an dd dd miei Hiei miei че о се e -ч ei =e e 
AOS Re GSTS EE EU 
8 | 2: 8h 58 зе 52 3% 52 ES 39 55 88 ES 59 
8| 12328 dd da 24 Sa 24 Яа 54 54 да adi aa 
ue ne ес хр Sr Rex Е 
«| ах 58 88 32 sr $9 92 27 ЫЗ 25 9% SH 58 
SA 8 dd dd 54 aa 54 24 od 54 ed XN -4 
RE eg те eR E ms 5 ae 
g| 33 25 28 88 ав SR 52 37 52 85 85 2% 55 
38 58 18 dd du aa 24 24 24 34 За 54 54 
НД eran Sone 
| g| 58 as SS 15 52 88 8$ 82 BR 52 $9 5% 95 
, NS 85 28 dd dd du 25 24 24 3452 5434 
ae Sanne ug um nes 
g| 55 88 82 58 сш 53 SE SR 55 89 5% 52 Re 
іб веб че ain ad ce ad се wa мч cd че ~ 
a so saran Go IMS E TP 
x | зз За а= $28 5% =a 8 8% 85 Se 5% $9 9% 
; 45 99 95 22 12 dd dd dd ач ач 54 ca nel 
au us xus Eve е 
а |98 99 3 дз 25 98 52 88 52 if 85 ЕР 38 
» 18 23 50 dd dd dé de 44 ciel de «в md cd 
S. 
ne Lus Бер, 
х o| зз 5% 25 85 53 ag 28 че сз 22 88 == SE 
t 3$ 2 10 dd 55 dé dé dd dd da da ad cid 
8 E A Ge qu Sm wu sp on USE 
5| «| #2 33 5% 28 85 82 88 8Б 28 хь 28 58 22 
= SE $45 ӨЧ dd цы че че че da dd чы dd ач 
ginn USER лық Oe" ici 
| ж өш ош co om чу aa о © ә v 
E| a| 28 55 38 93 35 58 88 Яс 35 88 28 5% 7 
3 58 че 24 dd dd dd че 45 45 че dé чч «в 
È o noes ene че © 
=| „| eg ae 95 39 5% 28 58 84 45 Ал 88 88 55 
о че Kiem cid ain че am am am ad ad ad cm сб 
Sal aan ace em ard ЕНЕ 
е | 92 53 5% 95 хп 52 £5 55 58 ga Ав АЗ ag 
52 55 44 dd 22 dà 42 dd 48 dd dà dd «в 
НЕ 
Е 29 са чу 8 ТЕН 
„| 92 2% 32 23 52 22 $9 5% 5% сз 54 ан БЕ 
88 48 55 152 dd dd dé dé 45 dm dd «в се 
ER оз әс ae a 
ә | 25 zz $8 5% 92 89 55 58 35 88 
99 $8 44 202 dd dd dé Wd че 46 
ата ЫН 40 palus de f cR 
~ | BR ex $8 92 99 SE йв SE %% 93 9 5% 55 
NS RS 52 52 dd 42 dé dd dd dé dé dd de 
W sese em m ТЕЛА ee EAE a 
o | 4¥ 23 £8 ез 55 22 BE Бш 98 ОБ 55 52 54 
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ә ЕРИ 2 2 
a| $8 25 93 z3 RG SH єз $3 SR 33 55 93 98 
$$ 54 51 59 19012 42 de dé че dm dd ad 
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2473254 us. cB CLEMENS = 
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п, Degrees of Freedom for Greater Mean Square 
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TABLE А-6 (continued) 


8 


п, Degrees of Freedom for Greater Mean Square 
ы; | | | | | | 
1 2 3 4 | 5 | 6 ا‎ 8 9 10 | 11 і2 | 14 16 | 20 24 30 40 50 75 
1 | | |. 
| 
50 | 4.03 | 3.18 | 2.79 | 2.56 | 240 2.29 | 2.20 | 2.13 | 2.07 | 2.02 | 1.98 | 1.95 1.90 | 1.85 1.78 | 1.74 1.69 | 1.63 1.60 | 1.55 
7.17 | 5.06 | 4.20 | 3.72 | 3.41 | 3.18 | 3.02 | 2.88 2.78 | 2.70 | 2.62 | 2.56 | 2.46 | 2.39 | 226 | 2.18 2.10 | 2.00 | 1.94 | 1.86 
55 | 4.02 | 3.17 | 2.78 | 2.54 | 2.38 | 227 | 2.18 | 2.11 2.05 | 2.00 | 1.97 | 1.93 1.88 | 1.83 176 | 172 1.67 1.61 1.58 | 1.52 
7.12 | 5.01 | 4.16 | 3.68 | 3.37 | 3.15 | 2.98 | 2.85 | 2.75 | 2.66 | 2.59 2.53 | 2.43 | 2.35 | 2.23 | 2.15 2.06 | 1.96 | 1.90 | 1.82 
60 | 400 | 3.15 | 2.76 | 2.52 | 2.37 | 225 | 2.17 | 2.10 | 2.04 | 1.99 1.95 | 1.92 | 1.86 1.81 1.75 1.70 1.65 1.59 | 1.56 | 1.50 
7.08 | 4.98 | 4.13 | 3.65 | 3.34 | 3.12 | 2.95 | 2.82 | 2. 71 | 2.63 | 2.56 | 2.50 | 2.40 | 2.32 | 220 | 2.12 2.03 | 1.93 | 1.87 | 1.79 
65| 3.99 | 3.14 | 2.75 | 2.51 | 2.36 | 224 | 2.15 | 2.08 | 2.02 | 1.98 1.94 | 1.90 | 1.85 1.80 | 1.73 1.68 1.63 1.57 1.54 1.49 
7.04 | 4.95 | 4.10 | 3.62 | 3.31 | 3.09 | 2.93 | 2.79 | 2.70 2.61 2.54 | 2.47 | 2.37 | 2.30 | 2.18 | 2.09 2.00 | 1.90 | 1.84 | 1.76 
то | 3.98 | 3.13 | 2.74 | 2.50 | 2.35 | 2.23 | 2.14 | 2.07 | 2.01 1.97 1.93 | 1.89 | 1.84 | 1.79 1,72 | 1.67 1.62 1.56 | 1.53 1.47 
gs 7.01 | 4.92 | 4.08 | 3.60 | 3.29 | 3.07 | 2.91 | 2.77 | 2.67 | 2.59 | 2.51 2.45 | 2.35 | 2.28 | 2.15 | 2.07 1.98 | 1.88 | 1.82 | 1.74 
= 
ж 80 | 396 | 3.11 | 2.72 | 2.48 | 233 | 221 212 | 2.05 1.99 1.95 1.91 1.88 | 1.82 | 1.77 1.70 | 1.65 1.60 | 1.54 | 1.51 1.45 
6.96 | 4.88 | 4.04 | 3.56 | 3.25 | 3.04 | 2.87 | 2.74 | 2.64 2.55 | 2.48 | 2.41 | 2.32 | 224 | 2.11 2,03 1.94 | 1.84 | 1.78 | 1.70 
100 | 3.94 | 3.09 | 2.70 | 2.46 | 2.30 | 2.19 | 2.10 | 2.03 1.97 | 1.92 | 1.88 | 1.85 1.79 1.75 1.68 1.63 1.57 1.51 1.48 | 1.42 
6.90 | 4.82 | 3.98 | 3.51 | 3.20 | 2.99 | 2.82 | 2.69 | 2.59 2.51 | 2.43 | 2.36 | 2.26 | 2.19 | 2.06 | 1.98 1.89 | 1.79 | 1.73 | 1.64 
125 | 3.92 | 3.07 | 2.68 | 2.44 | 229 | 2.17 | 2.08 | 2.01 1.95 1.90 | 1.86 | 1.83 1.77 1.72 | 1.65 1.60 1.55 1.49 | 1.45 1.39 
6.84 | 4.78 | 3.94 | 3.47 | 3.17 | 2.95 | 2.79 | 2.65 | 2.56 2.47 | 2.40 | 2.33 | 223 | 2.15 | 2.03 | 1.94 1.85 | 1.75 | 1.68 1.59 
150 | 3.91 | 3.06 | 2.67 | 2.43 | 2:27 | 2.16 | 2.07 | 2.00 | 1.94 | 1.89 1.85 | 1.82 1.76 | 1.71 1.64 | 1.59 1.54 | 1.47 | 1.44 | 1.37 
6.81 | 4.75 | 3.91 3.44 | 3.14 | 2.92 | 2.76 | 2.62 | 2.53 | 2.44 | 2.37 | 2.30 | 2.20 2.12 | 2.00 | 1.91 1.83 | 1.72 | 1.66 | 1.56 
200 | 3.89 | 3.04 | 2.65 | 2.41 | 226 | 2.14 | 2.05 1.98 1.92 | 1.87 1.83 | 1.80 | 1.74 | 1.69 1.62 | 1.57 1.52 | 1.45 | 1.42 | 1.35 
6.76 | 4.71 | 3.88 | 3.41 | 3.11 | 2.90 | 2.73 2.60 | 2.50 | 2.41 2.34 | 228 | 2.17 | 2.09 1.97 | 1.88 1.79 | 1.69 | 1.62 | 1.53 
400 | 3.86 | 3.02 | 2.62 | 2.39 | 223 | 2.12 | 2.03 1.96 | 1.90 | 1.85 1.81 |-1.78 | 1.72 | 1.67 1.60 | 1.54 1.49 | 1.42 1.38 1.32 
6.70 | 4.66 | 3.83 | 3.36 | 3.06 | 2.85 | 2.69 | 2.55 | 2.46 2.37 | 229 | 2.23 | 212 | 2.04 | 1.92 | 1.84 1.74 | 1.64 | 157 | 1.47 
1000 | 3.85 | 300 | 2.61 | 2.38 | 2.22 | 2.10 | 2.02 | 1.95 | 1.89 | 1.84 1.80 | 1.76 | 1.70 | 1.65 1.58 1.53 1.47 | 1.41 1.36 1.30 
666 | 4.62 | 3.80 | 3.34 | 3.04 | 2.82 | 2.66 | 2.53 | 2.43 | 2.34 2.26 | 2.20 | 2,09 | 2.01 1.89 | 1.81 1.71 1.61 1.54 | 1.44 
= з.84 | 2.99 | 2.60 | 2.37 | 2.21 | 2.09 | 2.01 1.94 | 1.88 | 1.83 1.79 | 1.75 1.69 | 1.64 | 1.57 1:52 1.46 | 1.40 | 1.35 1.28 
6.64 | 4.60 | 3.78 | 3.32 | 3.02 | 2.80 | 2.64 | 251 | 2.41 2.32 | 2:24 | 2.18 | 2.07 | 1.99 | 1.87 | 1.79 1.69 | 1.59 | 1.52 | 1.41 
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TABLE A-7 


RANDOM NUMBERS* 


87 02 22 


39 
28 
97 
69 


87 
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15 
85 
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99 
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22 
16 
88 
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47 
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72 
53 
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26 
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13 
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70 
38 
11 
70 
52 
97 
46 
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70 
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53 
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70 
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* Reprinted from Table XXXIII of Fisher and Yates’ Statistical Tables, with the 
permission of the authors and publishers, Oliver and Boyd, Edinburgh, 
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SOME MATRIX CALCULATION METHODS 


1. Evaluation of determinants. 
a. Unsymmetrical. 


Let the determinant be as shown on the left. The main calculations 
involve setting up the auxiliary matrix as on the right. 


Given Matrix Auxiliary Matrix 
аһ a аз аң ап Xiz X13 Ха 
а бәз а а ац хәз X SM 
аз Аза 433 аз азу 228 зз жи 
an а аз ац а Pus Ж хи 


0201019010 
(1) Write down the first column from the given matrix and 


Хз = арја Хз = dial Xj = Gal 


(2) Хә = dog — 421X12 


X32 = dag — dario 


Хә = 049 — Ay X12 


(3) хоз = (dog — зз) Xon. 


Хзз = 433 — азуХуз — ХазХәҙ 


Хаз = dag — 48 Хз — ХазХәҙ 
(4) Хы = (йм — ад) X22 
X34 = (азд — 43114 — ХзәХәд)/ X33 
Хад = daa — 4аХа-- Xa2Xoa — ХазХза 
D = ay X Хә X Ха X Ха 


EXAMPLE— VALUE OF А DETERMINANT 


Given Matrix 


27.948 — 6.2992 — 5.2125 8.1643 
— 84.522 174.46 19.412 — 9.8742 
— 6.3428 4.5542 3.9844 12.748 

7.1932 — 10.764 20.398 115.35 


Auxiliary Matrix 


27.948 — 0.22539 — 0.18651 0.29212 
— 84.522 155.41 0.023472 0.095337 
— 6.3428 3.1246 2.7281 5.2428 
7.1932 — 9.1427 21.954 — 0.98007 


D = 21948 x 155.41 x 2.7281 x — 0.98007 = — 11613 
451 
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b. Symmetrical. 
Let the determinant and auxiliary matrix be 


Given Matrix Auxiliary Matrix 
au а (із au а Xm um OE 
ач аз а а аз Xm хуз XM 
Hus 2 en nas аз ^з Xas X34 
LM MES аң Ха Ха Ха 


ay | о | (9 | @ 
The steps аге: 


(1) Same as for the unsymmetrical matrix. 


(2) Ха = dg — Хуа 
Хаа = lag — (ізі Xog = Xaa Xon 
Хар = 434 — AXi Ха = Xag Xon. 
(3) Хаз = daga — 448 — X32393 
Хас" dg, — ацХуз — anon Xa = Xal Xas 
(4) Хи = ац 0X1 — ХаХа — азза 


D = ay X Ха X Ха X Xu 


EXAMPLE—EVALUATION OF A DETERMINANT 


Given Matrix 
1.2643 0.98724 0.40312 1.5734 
0.98724 1.4396 2,0937 — 0.9682 
— 0.40312 2.0937 0.87326 2.0431 
1.5734 — 0.96820 2.0431 2.6114 


Auxiliary Matrix 


1.2643 0.78086 — 0.31885 1.2445 

0.98724 0.66870 3.6017 — 3.2855 

— 0.40312 2.4085 - 7.9300 — 1.3188 

1.5734 — 2.1970 10.458 7.2271 
D = — 48.453 


2. Simultaneous equations. 

Suppose that the equations are 
2,5 X, + аХ + di X = ац 
азХ dg Xs + ау X = ад 
413 X, + 455 Xs + аҙ X = а 
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We set up the given matrix and calculate the auxiliary matrix. 


Given Matrix Auxiliary Matrix 
ai аз аз ац au Xin Xn хи 
а аз Әз Әд а Ха Xos Xu 
ais dos азз а аз Ха Xas Хм 


a | @ (3) 


(4 


(1) Same as in evaluation of determinants. 


(2) Ха = dag — 2X12 
Хаз = dag — 4із із Xog = Xaa Xov 
Ха = (n = ах) хәз 
(3) Хаз = dg — (бізі Хз Хав 
(4) Хад = (аз ~ 130 — ХуәХз)/Хзз 
Тһеп 
Ху = Хи 


Xa = Хы Хз 


X, = хц XX — Хал Ху 


EXAMPLE—SOLUTION OF SIMULTANEOUS EQUATIONS 


Given Matrix 


947.17 — 20.834 272.76 694.96 
~ 20.834 414.23 - 254.00 79.648 
272.76 = 354.00 630.00 407,00 


Auxiliary Matrix 


947.17 ~ 0.021996 0.28797 0.73372 

~ 20.834 413.77 — 0.84105 0.22944 

272.76 — 348.00 258.77 1.1080 
X, = 1.1080 


X, — 0.22944 — (— 0.84105 x 1.1080) — 1.1613 
X, = 0.73372 — (— 0.021996 x 1.1613) — (0.28797 х 1.1080) = 0.44019 
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3. Calculation of Gauss multipliers and partial regression and correlation 
coefficients from a correlation matrix. 


Original Matrix 


Ай аз аз а 48 а 1 0 0 0 0 0 
аз аз аз ад 4% 446 0 1 0 0 0 0 
аз аз аз Ayu 4% 4% 0 0 1 0 0 0 
Qu ад ащ аң 48 а 0 0 0 0 0 
[/ 45 а 48 4% 4% 0 0 0 0 1 0 
йв Gas 46 4в 4% 46 0 0 0 0 0 1 
аһ 
біз Jn 
ав Уа Уз 
аң Уз Әз Ju 
а Js Уз Yu У 
ав О Ун Уа Jn Ум Jm Jw 
Steps (1) 
b Matrix c Matrix 

bo be ӛз ім bos Сі Cis Сз Си С С 
bs bss із bs bss Co Cos См Cas Сы 
bu ba ba bis bis Саз Cu Cas Сы 
ba bs ba ba by Gus. Cag meee 
by, ба bay ba by Css С» 

bia Әз bu бы bie Cos 

r Matrix 


he Па Ға ПП fe 
Ға Ғы іш Гы 


Ға 1% Tag 


2 ou 
"ама = А816 + Хорас + Хал + XgaXag ХХ 
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Calculation of Auxiliary Matrix 


(1) Enter first row and column from original matrix. 


(2) 
Хаз = Gea — (бізбіз 
X32 = баз — 913012 Ха = Ха/Хаз 
Хаз = Au — Mahe Хы = Xil Xas 
Хуг = das — 45012 Ха = Ха Xon. 
Хез = 4 — 216012 Xag = Ха Xen 
уз = = аа/Хаз 
Ува = ха 
(3) 
Хаз = 33 — 41393 — X32X23 
хаз = da — ацйа— XX Хи = Xas) Xas 
Xss = das — 415013 — X523 X35 = Xs3/ Xss 
Хез = das — 44 — ХөзХаҙ Ха = Ха/Хаз 
Yar = C7 dis — Хауа) Хз 
Уза = (С Хауи)/Хаз 
Узв = 1/39 
(4) 
Хи = ац — Aula — Ха — ХаХа 
Хы = das — Aysa — Ха — ХаХа Xas = Хы/а 
Xos = dag — ааа XonXos Ха Ха Xas = Xela 
Уа = ( 44 Хауа — Хауа) Ха 
Yaa = — Ха)уа- Xua Xa 
Уа = ( Хазуза) Xu 
ya = Мхи 
(5) 
X55 = 6 — 215015 — ХузХау — ХьзХ35 — XsaXas 
Xes = Ase — Ayers — XeaXes— XesXas — ХХ; Ха = Х/Х 
Vor = (= is — Xs2Ya1 — Ха) — Хыуп)/Хъв 
уз = (— Ха)Уа- Xsasa — XsaJas) [Xs 
Ysa = (— 3Xsass — ХыУш)/ 5 
Ум = C хыуи)/Х 
у = 1X55 
etc. 


Calculation of с Matrix 
(1) Cie = Vor 
Cis = Уз — ХвС16 
Cig = Yay — X45C15 — 24616 
Суз = Узі — Хабу — XasCis — Х36С16 


Cio = Vor — ХәзСуз — Ха 7 XosCis — X20016 
Cy = 1 — (нсіз бізбіз — Са — 415615 — 46616 
(2) С = Yon 
С = V52 — X56026 
Cog = Yag — X45C25 — XagCog 
Cog = Узв — XgaCoa — X35025 — ХзвС› 
С» = Jag — ХәзСәз — ХәдСәд — XogCos — Хә 
(3) С = Увз 
Cas = Увз — Х6С% 
C34 = Уаз — X45035 — X48636 
etc. 


Calculation of b Matrix 
General formula 
с 
by = — 
Cii 


(1) Calculate reciprocals for су, сз, * * >, свв» giving 41, doo, * * 


Q) 


bg = dais ә = diCos bes = desae bea = degC Б = 


(3) 


Dg = (ұй а = Фу bsg = duas sa = Фс Бу 


5 Calculation of r Matrix 
General formula. 


(1) Calculate 
n= E 1 
dpt EI °° ау 
Vey Ус, ^ Vss 
(2) Ne = C8u8e з С а ба ete. 
456 
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= 4С 


EXAMPLE—CALCULATION OF GAUSS MULTIPLIERS, PARTIAL REGRESSION AND CORRELATION COEFFICIENTS, 
AND MULTIPLE CORRELATION COEFFICIENT 


Original Matrix 
1.0000 — 0.4589 - 0.5612 — 0.3947 — 0.3123 0.6412 


1 0 0 0 0 0 
— 0.4589 1.0000 0.3114 0.0429 0.2861 - 0.3190 0 1 0 0 0 0 
— 0.5612 0.3114 1.0000 - 0.0655 0.1467 - 0.4462 0 0 1 0 0 0 
— 0.3947 0.0429 - 0.0655 1.0000 0.1882 - 0.3511 0 0 0 1 0 0 
— 0.3123 0.2861 0.1467 0.1882 1.0000 — 0.3092 0 0 0 0 1 0 
0.6412 - 0.3190 — 0.4462 - 0.3511 - 0.3092 1.0000 0 0 0 0 0 1 
Auxiliary Matrix 
1.0000 — 0.4589 - 0.5612 - 0.3947 - 0.3123 0.6412 1 
— 0.4589 0.7894 0.06824 — 0.1751 0.1809 — 0.03135 0.5813 1.2668 
— 0.5612 0.05387 0.6814 - 0.4074 - 0.05622 — 0.1243 0.7776 - 0.1002 1.4676 
— 0.3947 — 0.1382 — 0.2776 0.7069 0.1051 — 0.1935 0.9774 0.2084 0.5763 1.4146 
- 0.3123 0.1428 — 0.03831 0.07433 0.8667 - 0.1094 0.2151 — 0.2310 0.01545 — 0.1213 1.1538 
© 0.6412 — 0.02475 — 0.08467 - 0.1368 - 0.09486 0.5407 — 0.7525 0.05450 0.3784 0.3366 0.2023 1.8495 
b Matrix с Matrix 
0.4069 - 0.0295 - 0.2046 - 0.1820 - 0.1094 2.7004 0.6070 1.0247 0.8178 0.1328 - 0.7525 
— 0.1129 0.1913 - 0.0483 0.0718 - 0.1720 1.3522 — 0.00724 0.2426 - 0.2250 0.05450 
- 0.5494 - 0.1630 - 0.4323 0.0568 - 0.2261 1.7800 0.6435 0.05685 0.3784 
- 0.5757 0.0041 - 0.3616 - 0.0319 - 0.2126 1.4886 — 0.08448 0.3366 
— 0.4489 . 0.0054 - 0.1794 0.1664 - 0.0403 1.1759 0.2023 
— 0.2248 - 0.3795 - 0.3028 - 0.0492 0.2787 1.8495 
r Matrix 


—0.18 — 0.467 —0.408 — 0.073 0.337 

0005 — 0.171 0.178 - 0.034 

- 0.395 - 0.039  — 0.209 

0.064 - 0.203 

19153... ¢ = 0.64122 + (0.03135 х 0.02475) + (0.1243 x 0.08467) - 0.137 


+ (0.1935 x 0.1368) + (0.1094 х 0.09486) = 0.4593 
rias... = 0.678 


Index 


Abbreviated Doolittle solution, 139 
Abnormality, flatness, 33 
measures of, 34, 42, 43 
peakedness, 34 
skewness, 33 
types of, 33 
Acceptance number, 428 
Accuracy in experimental design, 194 
Addition theorem, 305, 362 
Adjustments to treatment means, fac- 
torial experiments, 236, 239 
lattice experiments, 273, 279, 284, 
287 
Algebraic functions, 8 
Aliases in fractional replication, 247 
ag, a4 (alpha), binomial distribution, 42 
normal distribution, 35 
Poisson distribution, 43 
Analysis, of balanced lattice, 274 
of covariance, 153 
application to the control of error, 
155 
calculation of sums of products, 
159 
corrections to means, 157 
example, 160 
principles of, 153 
standard errors of corrected means, 
157 
tests for heterogeneity of regres- 
sion, 158 
tests of significance in, 162 
of cross-over designs, 212 
of Latin squares, 203, 216 
of linear regression, 102 
of non-linear regressión, 166 
of non-orthogonal data, 304 
of quadruple lattice, 260, 275, 280 
of randomized blocks, 196, 214 
of split-plot experiments, 223 
of ungrouped experiments, 195 
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Analysis, of variance, balanced lattice, 
274 
cross-over designs, 212 
fundamental principles, 64 
Latin squares, 203, 216 
limitations of, 97 
linear regression, 107 
non-linear regression, 167 
non-orthogonal data, 317 
quadruple lattice, 260, 275, 280 
randomized blocks, 196, 214 
test of significance in, 69 
three-fold classification, 94 
two-fold classification, 73 
ungrouped experiments, 196 
Arbitrary origin for calculation of 
standard deviation, 25 
Arithmetic mean, 16 


Balanced lattice, 261, 274 
analysis of, 274 
calculation of sums of squares, 274, 
285 
outline of analysis, 271 
variances for mean differences, 275, 
287 
variety corrections, 273, 275, 287 
Basic experimental designs, 192 
cross-over design, 212 . 
Graeco-Latin square, 209 
Latin square, 203, 209, 216 
randomized blocks, 196, 214 
ungrouped randorhized, 195 
Bias, freedom from, in experimental 
design, 193 
test for, 47 
Binomial, аҙ, «4 for, 42 
expansion, 7, 354 
in goodness of fit tests, 354 
in sequential sampling, 341 
Binomial distribution, 38 
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Binomial sequential sampling, 430 

Binomial theorem, 7 

Biological assay, 394; see also Probit 
analysis 


сә (for average standard deviation), 
419, 445 
Calculation, of control limits, 420 
of correlation coefficient, 131 
of partial correlation coefficient, 145, 
454 
of partial regression coefficient, 136, 
454 
of regression coefficient, 113, 116 
of sample means for control charts, 
419 
of standard deviation, 24 
of standard deviations for control 
charts, 419 
of sums of products, 125 
in covariance analysis, 159 
of sums of squares, 82 
balanced lattice, 274, 285 
Chi-square (x2), definition, 361 
genetic ratios, 366 
goodness of fit tests, 361 
heterogeneity test, 362 
ratios with multiple 
364 
table of, 444 
Chi-square (x?) tests, for goodness of 
fit, genetic ratios, 366 
multiple classification, 364 
two or more classes, 359 
twofold classification, 353 
for independence, inr x c tables, 370 
in r x 2:tables, 371, 375 
in 2 X 2 tables, 372 
Class intervals for frequency table, 32 
Class values, selection of, 22 
Coding, 105 
Coefficient, of correlation, 125 
of multiple correlation, 138, 
147, 150 
of partial correlation, 135 
of partial regression, 135 
of variability, 25 
Combinations, formulas for, 6 


classification, 


144, 


INDEX 


Complete confounding, 230, 237 
in a 23 experiment, 230, 237 
principle of, 230 
Completed values, 307, 316, 322 
Confounding, complete, 230, 237 
in a 32 factorial experiment, 250 
in a 33 factorial experiment, 253 
in factorial experiments, 230 
in 2” experiments, 243 
partial, 235, 240 
Consumer’s risk, 428 
Control chart, 418 
calculations, 423 
control limits, for the mean, 419 
for the standard deviation, 420 
correction to standard deviation of 
sample mean, 419 
examples, 420, 422 
illustration, 418 
sample means, 419 
setting up, 418, 423 
standard deviation, 419 
summary of technique, 423 
Control of error by analysis of covari- 
ance, 155 
Corrections, for adjustment of average 
standard deviation, 445 
for block effects in confounded ex- 
periments, 236, 242 
for continuity in x2 tests, 358 
Sheppard’s, 22 
to treatment means, in factorial ex- 
periments, 236, 239 
in lattice experiments, 266, 272, 
279, 284 
in standard deviation of sample 
means, 419 
Correlation, 122 
definition, 122 
measurement, 124 
positive and negative, 123 
scatter diagrams, 123 
surfaces, 123 
Correlation coefficient, 125 
calculation, 131 
fiducial limits, 131 
partial, 144 
range and interpretation, 126 
sampling distribution, 129 


тт 
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Correlation coefficient, test of signifi- 
cance, 130 
z transformation, 130 
Covariance analysis, 153 
applications in the control of error, 
155 
calculation of sums of products, 159 
correction to means, 157 
example, 160 
principles, 153 
standard errors of corrected means, 
157 
test for heterogeneity of regression, 
158, 163 
tests of significance, 162 
Covariance technique for estimating 
missing values, 324 
Covariation, 122 
Cross-over design, 212 
outline of analysis for, 212 


' Cubic response, 167 


Defining contrast, definition of, 247 
in fractional replication, 247 
Degrees of freedom, 20 
in tests of independence, 370 
individual, 87 
partitioning of, 71, 87 
Deletion of a variable in partial re- 
gression analysis, 142 
Derivation, of partial regression equa- 
tion, 135 
of regression equation, 102 
Determinants, 11 
evaluation of, by matrix calculations, 
451 
symmetrical, 452 
unsymmetrical, 451 
Differentiation of soil types by dis- 
criminant function, 380 
Discriminant function, 378 
applications, 385 
applied to differentiation of soil 
types, 380 
assigning weights, 387, 390 
example, 388 
plant selection, 385, 388 
principles, 385 
test of significance, 385 


Disproportionate sub-class numbers, 
326 
considerations arising out of, 328 
effect of presence or absence of in- 
teraction, 329 
methods of analysis, 333 
гу с table, by fitting constants, 
339, 346 
by weighted squares of means, 
350 
rX2 table, by fitting constants, 
336, 341 
by weighted squares of means. 
338, 345 
2X2 table, by fitting constants, 
333 
by weighted squares of means, 
335 
multiple classification, 327 
single classification, 326 
Distribution, of means of samples, 37 
of tolerances, 394 
Distributions, binomial, 38 
normal, 29 
Poisson, 42 
Doolittle solution, abbreviated, 139 
Dosage-mortality experiments, 413 
Double sampling, 429 


Efficiency, in experimental design, 214 
in Latin square experiments, 217 
in randomized block experiments, 
200, 213 
Empirical probits, 399 
Enumeration data, 353 
Equations, logarithmic, 173 
polynomials, 173 
simultaneous, 9 
Error, experimental, 74 
selecting a valid, 90 
Error control, in covariance analysis, 
155 
selection of a variable for, 159 
Errors, first kind, 48 
second kind, 48 
Estimation of missing values, see Miss- 
ing values 
Evaluation, determinants, 11, 451 
simultaneous equations, 9, 452 
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Exact test for independence in 2 X 2 
table, 372 
Experimental design, accuracy, 194 
efficiency, 213 
in probit analysis, 415 
Experimental designs, basic, 192 
cross-over design, 212 
Graeco-Latin square, 209 
Latin square, 203, 209, 216 
randomized blocks, 196, 203, 214 
ungrouped randomized, 195 
Experimental error, 74 


F, correction for lattice experiments, 
280 
table, 446 
test, 70, 76 
Factorial experiments, 220 
complete confounding in, 230, 237 
confounding in, 221, 230 
confounding in 2” experiments, 243 
correction for block effects, 236 
design and analysis of simple, 221 
design for 3% experiments, 253 
fractional replication, 221, 246 
partial confounding, 235, 240 
principles, 220 
split plot, 222 
standard errors, 225 
2” designs іп one replication, 245 
Feeding trials, analysis of, 99 
Fiducial limits, correlation coefficients, 
131 
goodness of fit ratios, 358 
means, 59 
regression Coefficients, 107, 114, 117 
Fitting constants, analysis with dispro- 
portionate sub-class numbers, 
r X c table, 339, 346 
r x2 table, 336, 341 
2x2 table, 333 
equations for, 313 
non-orthogonal data, 310 
Fitting curves, non-linear regression, 
172 
normal, 30 
reasons for, 172 
Fitting orthogonal polynomials, 181 
Fitting polynomial equations, 174 


INDEX 


Formulas, for аҙ, «4, binomial distribu- 
tion, 42 
normal distribution, 35 
Poisson distribution, 43 
for combinations, 6 
for permutations, 5 
for sequential sampling, binomial, 
431 
Poisson, 433 
Fractional replication, 246 
an alias in, 247 
confounding, 248 
defining contrast, 247 
outline of analysis, 249 
principal block in, 247 
principle of, 247 
Frequency distributions, see Distribu- 
tions 
Frequency polygon, 24 
Frequency table, 20 
graphical representation, 23 
setting up, 21 
Functions, algebraic, 8 


gı = Ка/ (К), go = k'4/ (X2)? 
(measures of abnormality), 36 
standard errors of, 36 
Gauss multipliers, 139 
in matrix calculations, 454 
Genetic ratios, goodness of fit tests, 366 
Goodness of fit, 353 
multiple classification, 364 
9:3:3:1 ratio, 366 
two or more classes, 359 
x2 test of heterogeneity, 362 
twofold classification, numbers mod- 
erate to large, 356 
numbers small, 353 
Graeco-Latin square, theory of, in ex- 
perimental design, 209 


Heterogeneity, of regression, in probit 
analysis, 401, 406, 411 
tests for, 158, 163 
of variation, 63 
Histogram, 23, 30 


Incomplete block experiments, 257 
an elementary type, 257 
lattice experiments, 259 
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Incomplete block experiments, missing 
values, 300 
principles, 257 
Incomplete blocks, meaning of, 257 
Independence, tests of, degrees of free- 
dom, 370 
r X c tables, 370 
r X 2 tables, 371, 375 
2 x 2 tables, 372 
? test, 374 
.exact test, 372 
nature of, 369 
Individual degrees of freedom partition- 
ing, 87 
Inspection, 418 
and verification, 418 
sampling for, 427 
Interaction, effect with disproportionate 
sub-class numbers, 329 
first order, 77 
higher order, 80 
Introductory concepts, 1 
Invariance, 157 


ky, Кә, Кз, Ка (k statistics), 35 


Latin square experiments, 203 
analysis, 216 
efficiency, 204, 217 
estimation of missing values, 209 
outline of analysis, 210 
randomization, 205 
standard errors, 217 
Lattice designs, 259 
balanced, 261, 274, 285 
corrections to variety means, 266, 
272, 279, 284 
lattice squares, 288, 299 
partially balanced, 260, 275, 280 
standard errors, 270, 274, 280, 284, 
287 
theory, 262 
with groups repeated, 280 
with 2-p groups, 271 
Lattice experiments, 259 
balanced, 261, 274, 285 
corrections to variety means, 266, 
272, 279, 284 
diagrammatic representation of data 
from, 268 


Lattice experiments, partially balanced, 
260 
groups repeated, 280 
quadruple, 260, 275 
simple, 260 
triple, 260 
2 to p groups, 271 
standard errors, 270, 274, 280, 284, 
287 
Lattice squares, 288 
inaccuracies of weighting, 299 
type r= % (p + 1), 288 
type r — p 4- 1, 296 
L.D. 50 (median lethal dose), 397 
Least squares method for deriving re- 
gression equations, 104 
Limitations of analysis of variance, 97 
Linear regression analysis, both vari- 
ables subject to error, 108 
calculation, regression table, 110 
ungrouped data, 109 
general observations, 102 
graphs, 103, 106, 110 
test, for deviations from, 111 
of significance, 107, 117 
Linear response, 167 


Mathematical concepts fundamental to 
statistics, 3 
Matrices, 13 
Matrix, 13 
Matrix calculations, 451 
evaluation of determinants, 451 
partial regression and correlation co- 
efficients, 454 
solution of simultaneous equations, 
452 
Mean, arithmetic, 16 
calculation of, 24 
geometric, 126 
of samples, distribution of, 37 
Mean square, 69 
Measurement data, 353 
Measures of abnormality, 34 
ag, a4, normal distribution, 35 
binomial distribution, 42 
Poisson distribution, 43 
21, 82, normal distribution, 36 
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Missing values, effect of, 305 
estimation, 318 
covariance technique, 324 
incomplete block experiments, 300 
Latin square, 209 
lattice experiments, 300 
randomized blocks, 203 
т x c table, 318 
more than one missing value, 
320 
one missing value, 318 
2 x c table, 307 
standard errors, for experiments with 
more than one missing value, 
* 323 
for experiments with one missing 
value, 320 
Multiple correlation, 144 
calculation, 138, 144, 148 
coefficient, 138, 144, 147, 150 
interpretation, 146, 150 
significance, 150 
Multiple measurements in discriminant 
function, 378 
Multiple regression, 134 
analysis, 150 
equation, 135 
interpretation, 146, 150 
problem, 134 
special applications, 150 
test of significance, 150 


Non-linear regression, 166 
analysis of components, 166 
analysis of regression components, 
168, 171 
analysis of variance, 167 
basic factors, 171 
selecting a regression equation, 173 
Non-orthogonal data, analysis of vari- 
ance, 317 
treatment by fitting constants, 310 
2 x 2 table, 306 
Non-orthogonality, 304 
meaning, 304 
owing to disproportionate sub-class 
numbers, 326 
Normal curve, 30 
equation, 30 
fitting, 31 


Normal curve, histogram, 30 
table of ordinates, 442 
Normal distribution, 28 
ag, a4, 35 
experimental derivation, 28 
histogram, 30 
Normal equations, linear regression, 
104 
non-linear regression, 175 
orthogonal polynomials, 176 
Null hypothesis, 59 


Objective in experimental designs, 194 
Ordinates of normal curve, table, 442 
use of, in fitting, 29 
Original values, 307, 316, 322 
Orthogonal polynomials, fitting by 
Fisher’s summation method, 181 
fitting with Fisher and Yates’ tables, 
186 
test of significance, 184 
Orthogonality, 70, 304 
effect of missing values, 305 


Paired values, regression analysis for, 
113 
regression table, 114 
t test, 57 
Parameters, 19 
Partial confounding, 235 
in a 23 experiment, 235, 240 
principle of, 235 
Partial correlation, 144 
coefficient, 144 
calculation, 145, 454 
Partial regression, 134 
analysis, 146 
coefficient, 135 
calculation, 136 
different sets of Y values, 138 
Gauss multipliers, 139, 454 
standard error, 142 
test of significance, 145 
deletion of a variable, 142 
methods, 134 
Partially balanced lattice, 260 
analysis, 271 
correction to F test, 280 
outline of analysis, 272 
quadruple, 260, 275, 280 
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Partially balanced lattice, simple, 260 
standard errors in, 280 
triple, 260 
variety corrections, 272 
Partitioning, of error components, 200 
of sums of squares and degrees of 
freedom, 71, 87 
Permutations, formulas for, 5 
Plant selection, example, 388 
use of discriminant function for, 385 
Poisson distribution, 42 
applications, 43 
verification trials, 436 
basic principles in verification trials, 
437 
formulas for ag, a4, 43 
sequential sampling, 433 
Polynomial equations, 173, 175 
fitting, 174, 186 
example, 177, 181, 188 
Population, concept of, 15 
homogeneous, 65 
Principal block, definition, 248 
in fractional replication, 247 
methods of writing, 247 
Principles, covariance analysis, 153 
experimental design, 192 
Probability, basic rules, 4 
definition, 3 
of acceptance, 44 
of rejection, 44 
Probability integral, 30 
table of, 441 
Probit analysis, 394 
data appropriate for, 394 
determination of relative potency, 
408 
experimental design, 415 
graphical methods, 414 
heterogeneity in, 401, 406, 411 
practical applications, 397 
regression coefficient, 397 
fiducial limits, 401 
test of significance, 401 
Probit regression line, 398 
fitting a provisional, 398 
fitting by maximum likelihood, 404 
weighting coefficients, 398 
Probit transformation, 395 
Producer’s risk, 429 


Quadratic response, 167 
Quadruple lattice, 260, 275 
Quality control, 418 

definition, 418 

examples, 420, 422 

setting up control chart, 418, 423 
Quantity of information, 157 


Random numbers, 450 
Randomized block experiments, 196, 
214 
analysis, 196, 214 
efficiency, 200, 213 
estimation of missing values, 203 
outline of analysis, 196 
partitioning error components, 200 
standard errors, 197, 201, 215 
tests of significance, 215 
variance components, 198 
Range, table of range/standard devia- 
tion, 21 
Regression, deviations from linear, 106 
goodness of fit, 106, 398 
testing for heterogeneity, 158, 163 
Regression analysis, linear, 102 
general observations, 102 
paired values in regression table, 
114 
series of paired values, 113 
tests of significance, 107 
Regression coefficient, fiducial limits, 
107, 114, 117 
probit analysis, 397 
test of significance, 107, 401 
testing heterogeneity of series of, 158, 
163 
Regression components, 166 
analysis of, with Fisher and Yates’ 
tables, 167 
cubic and higher degree, 167 
example of analysis, 168 
linear, 167 
quadratic, 167 
Regression equations, derivation, 102, 
104 
Regression table, 118 
calculation, 112 
Relative deviate, 357 
Relative potency, determination in 
probit analysis, 408 


466 INDEX 


Response, cubic, 167 
linear, 167 
quadratic, 167 


Samples, 16 
Sampling, inspection, 426 
double, 429 
sequential, 430 
single, 427 
Sampling distribution, correlation co- 
efficient, 129 
Scope, sufficiency in experimental de- 
sign, 195 
Selecting a valid error, 75, 90 
Sequential sampling, 430 
binomial distribution, formulas, 431 
graphs, 435 
Poisson distribution, formulas, 433 
graphs, 435 
principle, 430 
Sheppard’s corrections, 22 
Significance, tests of, see Tests of sig- 
nificance 
Simple lattice, 260 
Simultaneous equations, evaluation by 
matrix calculations, 452 * 
Single sampling, 427 
Skewness, 33 
Split plot experiments, 222 
example, 225 
outline of analysis, 223 
plan, 222 
standard errors, 224, 243 
sums of squares, 224 
Square yard plots, table of yields, 26 
Standard deviation, 17 
calculation, 24 
control charts, 419 
control limits, 420 
of a mean, 53 
sample mean, correction to, 419 
unbiased estimates of, 19 
Standard error, experiments with miss- 
ing values, 323 
Latin square experiments, 217 
lattice experiments, 270, 274, 280, 
284, 287 
of a mean, 20 
of estimated Y, 108 


Standard error, randomized block ex- 
periments, 197, 201, 215 
split plot experiments, 224, 243 
Standard regression coefficients, 144 
Statistic, definition of, 16 
Statistics, introductory concepts, 1 
Sterling's approximation, 361 
Summation method of fitting orthogonal 
polynomials, 181 
Sums, of products, calculation, 125 
in covariance analysis, 159 
of squares, calculation, К groups, п 
not the same for each group, 82 
of n variates each, 82 
(m X n)-fold table, 84 
non-orthogonal data, 316, 323 
split plot experiments, 224, 227 
three factor interaction, 85 
two factor interaction, 84 
(2 X n)-fold table, 82 
(2 X 2) table interaction, 78 


t, table of, 443 
t test, 50 
logic of, 47 
variances different, 58 
variates not paired, 55 
variates paired, 57 
Table, of c» (for average standard de- 
viation), 445 
of chi-square (x?), 444 
of F, 446 
of ordinates of normal curve, 442 
of probability integral, 441 
of random numbers, 450 
of range/standard deviation, 21 
of t, 443 
Tests, of independence, degrees of free- 
dom, 370 
nature of, 369 
r X c tables, 370 
r X 2 tables, 371, 375 
2X 2 tables, 372 
x? test, 372 
exact test, 372 
of significance, analysis of covariance, 
153 
correlation coefficient, 130 
distribution of f, 50 
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Tests, of significance, heterogeneity of 
regression coefficients, 158, 401 
logic of, 47 
mean difference, 50, 56 
multiple correlation coefficient, 150 
partial regression coefficient, 145 
probit analysis, 401 
randomized blocks, 215 
regression coefficient, 107, 401 
Theoretical frequency distributions, 27 
normal, 28 
derivation by sampling random 
numbers, 28 
equation, 30 
histogram, 30 
Theory of lattice designs, 262 
Threefold classification of variates, 94 
Tolerances, distribution of, 394 
Transformations, fractions of percent- 
ages, 98 
inverse sine, 98 
logarithmic, 98 
small whole numbers, 98 
square root, 98 
whole number counts with a wide 
range, 98 
Triple lattice, 260 
Twofold classification of variates, analy- 
sis of variance, 73 
goodness of fit tests, 353 
Types of abnormality, see Abnormality 
Ungrouped randomized experiments, 
195 
calculation of sums of squares, 195 
outline of analysis, 195 


Variance, analysis, 63 
limitations, 97 
test of significance, 69 
definition, 20 
for mean difference, balanced lat- 
tice, 275, 287 


Variance components, randomized 
block and ungrouped ехрегі- 
ments, 198 


Variates, continuous, 27 
discontinuous, 27 
threefold classification, 94 
twofold classification, 73 
Variation, 15 
assignable causes, 419 
heterogeneity of, 63 
Verification trials, application of Pois- 
son distribution, 436 
basic Poisson distribution for, 437 
definition, 436 
limits of accuracy, 439 


Weighted squares of means for analysis 
with disproportionate sub-class 
numbers, rx c table, 344, 350 

r X 2 table, 338 
2 x 2 table, 335 

Weighting coefficients in probit anal- 
ysis, 398 

Wireworm counts, analysis of, 101 

Working probit, 404 


z=% log, (Vi/V2), 69 
test for differences between correla- 
tion coefficients, 131 
transformation for correlation co- 
efficient, 130 
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